## How to decide if a language is In R or RE or CoRE - turing-machines

### Deciding finite prefix language of a real number

```why is the language of finite prefixes of the number pi decidable by a TM whereas it is false to say there is a TM for any real number that which decides the finite prefixes of that given number?
```
```Why is the language of finite prefixes of the number pi decidable by a TM
There is an effective computational procedure to print out finite prefixes of the digits of pi. The Maclaurin series for sin(x) is x - x^3/3! + x^5/5! - ... Furthermore, we know that sin(pi/2) = 1, so we can set 1 = x - x^3/3! + x^5/5! - ..., start somewhere close (like x = 1.5) and find the largest value of x that increased over its predecessor. Then, multiply this by 2 and keep the first n digits to get a prefix of length n. For instance:
f(1.50) < f(1.51) < f(1.52) < f(1.53) < f(1.54) < f(1.55) < f(1.56) < f(1.57)
f(1.59) < f(1.58) < f(1.57)
This tells us that x = 1.57 is the closest value to pi/2 and is either a little bigger or a little smaller than we really need. We can tell by checking the Maclaurin series for cos(x): we see that cos(1.57) converges to a positive number, so we know we are on the largest number with n digits that is less than pi/2. Keep the computation at least one level lower than you need to return digits of pi and everything will turn out fine.
whereas it is false to say there is a TM for any real number that which decides the finite prefixes of that given number
Here is a real number for you: the real number between 0 and 1 whose decimal representation as the nth digit set to 1 if and only if the nth Turing machine (determined by lexicographical ordering of the UTM encoding of all TMs) accepts the empty language. This is a well-defined real number - each of its decimal digits is either 0 or 1 - and yet if we had a TM that could find any finite prefix of this real number, we could answer the question "does this TM accept the empty language?" for any TM by:
encoding the TM in the UTM encoding
enumerating all strings in lexicographical order and counting how many valid UTM encodings of TMs we find
halting the computation when we count the TM we want the answer for
asking for a prefix whose length is equal to the count we just got
checking the last digit of the prefix to see whether our TM accepts the empty language or not
This is a contradiction since it is undecidable whether a TM accepts the empty language or not. Therefore, our assumption that we could compute finite prefixes of this real number was incorrect.
For any undecidable problem involving TMs (Rice's theorem and diagonalization arguments guarantee that most languages of UTM encodings of TMs are) you get a unique real number which cannot be computed. Indeed, the computable real numbers are countable, just like Turing machines, but unlike languages and real numbers, which are uncountable.```

### Check whether 2 languages are Turing recognizable or co-Turing recognizable

```I have these 2 languages
A = {<M> | M is a TM and L(M) contains exactly n strings }
B = {<N> | N is a TM and L(N) contains more than n strings }
I believe that these 2 are undecidable, but I am not sure whether they are Turing recognizable or co-Turing recognizable.
```
```B is Turing recognizable since we can interleave executions of M on all possible input tapes. If more than n of the running instances of M ever halt-accept, then halt-accept.
We know that A cannot be Turing-recognizable because, if it were, the language B' = {<N> | N is a TM and L(N) contains no more than n strings } would be Turing-recognizable (we could interleave the execution of recognizers for 1, 2, ..., n and halt-accept if any of those did). That would imply both B and B' were decidable since B' must be co-Turing recognizable.
If A were co-Turing recognizable, we could recognize machines that accept a number of strings different than n. In particular, let n = 1. We can run the recognizer for machines whose languages contain other than n strings on a TM constructed to accept L(M) \ {w} for every possible string w. At each stage, we run one step of all existing machines, then construct a new machine, and repeat, thus interleaving executions and ensuring all TMs eventually get to run arbitrarily many steps.
Assuming |L(M)| = 1, exactly one of these TMs will halt-accept (the one that removes the only string in L(M)) and the rest will either halt-reject or run forever. Therefore, a recognizer for |L(M)| != 1 can be used to construct a recognizer for |L(M)| = 1. This generalizes to |L(M)| != k and |L(M)| = k by subtracting all possible sets of k input strings.
Therefore, if A were co-Turing recognizable, it would also be Turing recognizable, thus decidable. We already know that's wrong, so we must conclude that A is not co-Turing recognizable; nor is it Turing recognizable.```

### Machine learning preprocess strings to numbers based on string similarity

```I need to preprocess data into numbers in order to be able to apply ML algorithms in a dataset, but there is this feature that is almost Tree structured with strings which I have no idea how to transform. Here goes an example:
Feature -> Value I would like to transform to (example):
X Y Z foo -> 0.5
X Y Z bar -> 0.501
A B C foo -> 4.1
W B C foo -> 5
Essentially the string would transform into a unique real number, where this number would be really close to other numbers if their strings were almost identical, giving greater weights to the first words that come up first on the String.
My question, is there an already existing algorithm to solve this?
```
```First of all, I do not know any algorithm to solve this. But I have an idea (I know this is not "an answer" but I lack the reputation to add this as a comment).
Transform every string by repeating every character proportional to their position from the end. For example, "Foo" will become "FFFooo" and "abcd" "aaaabbbccd". Then use the edit-distance over every pair to build a distance matrix, M.
Now it is an optimization problem. Start with random solution (a random real to every word), then calculate the distance matrix M' of your solution and minimize some metric (squared error) between M and M'.
```
```It seems to me from your example that you are trying to find the similarity between two document of texts. Cosine similarity is the most widely used distance measure in this context.
For instance, let us compare "The boy eats the apple" and "The girl eats the pear"
First, create a frequency matrix, where entry ij contains the number of times term j is contained in document i.
"the" "boy" "girl" "eats" "apple" "pear"
(sentence 1) 2 1 0 1 1 0
(sentence 2) 2 0 1 1 0 1
The cosine similarity can thus be calculated with
From Principles of Data Mining:
This is the cosine of the anvgle between the two vectors (equivalently, their inner product after each has been normalized to have unit length) and, thus, reflects similarity in terms of the relative distribution of their term components.```

### Selecting parameters for string hashing

```I was recently reading an article on string hashing. We can hash a string by converting a string into a polynomial.
H(s1s2s3 ...sn) = (s1 + s2*p + s3*(p^2) + ··· + sn*(p^n−1)) mod M.
What are the constraints on p and M so that the probability of collision decreases?
A good requirement for a hash function on strings is that it should be difficult to find a
pair of different strings, preferably of the same length n, that have equal fingerprints. This
excludes the choice of M < n. Indeed, in this case at some point the powers of p corresponding
to respective symbols of the string start to repeat.
Similarly, if gcd(M, p) > 1 then powers of p modulo M may repeat for
exponents smaller than n. The safest choice is to set p as one of
the generators of the group U(ZM) – the group of all integers
relatively prime to M under multiplication modulo M.
I am not able to understand the above constraints. How selecting M < n and gcd(M,p) > 1 increases collision? Can somebody explain these two with some examples? I just need a basic understanding of these.
In addition, if anyone can focus on upper and lower bounds of M, it will be more than enough.
The above facts has been taken from the following article string hashing mit.
```
```The "correct" answers to these questions involve some amount of number theory, but it can often be instructive to look at some extreme cases to see why the constraints might be useful.
For example, let's look at why we want M ≥ n. As an extreme case, let's pick M = 2 and n = 4. Then look at the numbers p0 mod 2, p1 mod 2, p2 mod 2, and p3 mod 2. Because there are four numbers here and only two possible remainders, by the pigeonhole principle we know that at least two of these numbers must be equal. Let's assume, for simplicity, that p0 and p1 are the same. This means that the hash function will return the same hash code for any two strings whose first two characters have been swapped, since those characters are multiplied by the same amount, which isn't a desirable property of a hash function. More generally, the reason why we want M ≥ n is so that the values p0, p1, ..., pn-1 at least have the possibility of being distinct. If M < n, there will just be too many powers of p for them to all be unique.
Now, let's think about why we want gcd(M, p) = 1. As an extreme case, suppose we pick p such that gcd(M, p) = M (that is, we pick p = M). Then
s0p0 + s1p1 + s2p2 + ... + sn-1pn-1 (mod M)
= s0M0 + s1M1 + s2M2 + ... + sn-1Mn-1 (mod M)
= s0
Oops, that's no good - that makes our hash code exactly equal to the first character of the string. This means that if p isn't coprime with M (that is, if gcd(M, p) ≠ 1), you run the risk of certain characters being "modded out" of the hash code, increasing the collision probability.
```
```How selecting M < n and gcd(M,p) > 1 increases collision?
In your hash function formula, M might reasonably be used to restrict the hash result to a specific bit-width: e.g. M=216 for a 16-bit hash, M=232 for a 32-bit hash, M=2^64 for a 64-bit hash. Usually, a mod/% operation is not actually needed in an implementation, as using the desired size of unsigned integer for the hash calculation inherently performs that function.
I don't recommend it, but sometimes you do see people describing hash functions that are so exclusively coupled to the size of a specific hash table that they mod the results directly to the table size.
The text you quote from says:
A good requirement for a hash function on strings is that it should be difficult to find a pair of different strings, preferably of the same length n, that have equal fingerprints. This excludes the choice of M < n.
This seems a little silly in three separate regards. Firstly, it implies that hashing a long passage of text requires a massively long hash value, when practically it's the number of distinct passages of text you need to hash that's best considered when selecting M.
More specifically, if you have V distinct values to hash with a good general purpose hash function, you'll get dramatically less collisions of the hash values if your hash function produces at least V2 distinct hash values. For example, if you are hashing 1000 values (~210), you want M to be at least 1 million (i.e. at least 2*10 = 20-bit hash values, which is fine to round up to 32-bit but ideally don't settle for 16-bit). Read up on the Birthday Problem for related insights.
Secondly, given n is the number of characters, the number of potential values (i.e. distinct inputs) is the number of distinct values any specific character can take, raised to the power n. The former is likely somewhere from 26 to 256 values, depending on whether the hash supports only letters, or say alphanumeric input, or standard- vs. extended-ASCII and control characters etc., or even more for Unicode. The way "excludes the choice of M < n" implies any relevant linear relationship between M and n is bogus; if anything, it's as M drops below the number of distinct potential input values that it increasingly promotes collisions, but again it's the actual number of distinct inputs that tends to matter much, much more.
Thirdly, "preferably of the same length n" - why's that important? As far as I can see, it's not.
I've nothing to add to templatetypedef's discussion on gcd.```

### Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

```I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
Thanks!
```
```This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
#article{DBLP:journals/corr/GutinJSW14,
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \\$k\\$-Chinese Postman Problem and \\$k\\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {http://arxiv.org/abs/1402.2137},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
```
```You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Edit:
Please refer to David's answer for a fuller and correct explanation of the problem.
```
```Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.
```
```Hash Map data structure (that allows duplicates) is suitable for solving the problem.
Let the string be s1 and s2. The algorithm iterates through both the string and whenever a mismatch is found the algorithm maps the character of s1 to s2 i.e char of s1 as key and char of s2 as value is inserted in Hash Map wherever mismatch is occurred.
After this initialize the result as zero.
The next step is while the Hash Map is not empty do following:
For any key k find its value v.
Now use value v as the key to lookup in the Hash Map to find value if the found value is equal to k then increment the result by 1 and remove both the keys k and v from the Hash Map.
If the found value is not equal to k then only remove key k from the Hash Map and increment the result.