Base62 coding

Asked

Viewed 357 times

6

I wonder where I can get some PHP implementation, similar to MIME Base64 PHP but only with A-Z, a-z and 0-9 characters.

The Base64 from PHP is quite versatile, but I need an algorithm that doesn’t contemplate the characters +, /, - and =. I know that I can replace the characters mentioned for the purpose of a URL, which is not even the case, but actually intended a direct coding algorithm.

The encoding of numbers for base62, is linear, but intended to encode a binary PHP string. Important to perform the encounter and the Decode.

Can someone tell me some practical implementation?

  • https://github.com/vinkla/base62

  • @gmsantos I appreciate the reference but I already knew, but as I explained to me the Base62 coding of numerical values is basic and linear. What I want is an encoder with the range (A-Z, a-z and 0-9) but for strings. And that I don’t find and I think I really have to develop one and nail it. But.

  • @chambelix saw my answer?

  • @Victor saw and I’ll answer I’m analyzing :)

2 answers

4


Follows the implementation.

  • Implemented in two different languages: PHP and Java.
  • Allows you to specify the alphabet in the constructor.
  • The alphabet size is obtained from the alphabet itself.
  • Should work for any alphabet size >= 2 and < 256.
  • The functioning of encode consists in interpreting the String input as a base-256 number to be converted to BigInteger. Then the BigInteger is converted to a String base-62 (or any other, according to the given alphabet).
  • The functioning of decode is just the reverse of encode. Receives the String as if it were a number in base-62 (or any other), converts to BigInteger and then converts the BigInteger for a String base-256.

PHP:

Here’s the code:

<?php

include('Math/BigInteger.php');

class BaseN {

    private $base;
    private $radix;
    private $bi256, $one, $zero;

    function __construct($base) {
        $this->base = $base;
        $this->radix = new Math_BigInteger(strlen($base));
        $this->bi256 = new Math_BigInteger(256);
        $this->zero = new Math_BigInteger(0);
        $this->one = new Math_BigInteger(1);
    }

    public function encode($text) {
        $big = $this->one;
        for ($j = 0; $j < strlen($text); $j++) {
            $big = $big->multiply($this->bi256)->add(new Math_BigInteger(ord($text[$j])));
        }
        $result = "";
        while (!$this->zero->equals($big)) {
            $parts = $big->divide($this->radix);
            $small = intval($parts[1]->toString());
            $big = $parts[0];
            $result = $this->base[$small] . $result;
        }
        return $result;
    }

    public function decode($text) {
        $big = $this->zero;
        for ($j = 0; $j < strlen($text); $j++) {
            $i = strpos($this->base, $text[$j]);
            $big = $big->multiply($this->radix)->add(new Math_BigInteger($i));
        }
        $result = "";
        while (!$this->zero->equals($big)) {
            $parts = $big->divide($this->bi256);
            $small = $parts[1]->toBytes();
            $big = $parts[0];
            $result = $small . $result;
        }
        return substr($result, 1);
     }
}

?>

Mode of use:

// Passa o alfabeto como parâmetro. Tem 62 caracteres aqui, então são 62 símbolos.
$k = new BaseN("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
$x = "The quick brown fox jumps over a lazy dog";
echo $x . "\n";
$c = $k->encode($x);
echo $c . "\n"; // Escreve "1u9WLfG65OMtVkQWPtWDcC6o8IjI5td5l9DzpilIK4Nyx81tKLRrStPj"
$d = $k->decode($c);
echo $d . "\n"; // Escreve "The quick brown fox jumps over a lazy dog"

See here in ideone (don’t be alarmed by the size of the code, I had to put the class BigInteger entire there).

In Java

And if anyone is interested, I also implemented it in Java. Here’s the code:

import java.util.List;
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Collections;

/**
 * @author Victor
 */
public class BaseN {
    private static final BigInteger BI_256 = BigInteger.valueOf(256);

    private final String base;
    private final BigInteger radix;

    public BaseN(String base) {
        this.base = base;
        this.radix = BigInteger.valueOf(base.length());
    }

    public String encode(String text) {
        byte[] bytes = text.getBytes();
        BigInteger big = BigInteger.ONE;
        for (byte b : bytes) {
            big = big.multiply(BI_256).add(BigInteger.valueOf(b));
        }
        StringBuilder sb = new StringBuilder(bytes.length * 4 / 3 + 2);
        while (!BigInteger.ZERO.equals(big)) {
            BigInteger[] parts = big.divideAndRemainder(radix);
            int small = parts[1].intValue();
            big = parts[0];
            sb.append(base.charAt(small));
        }

        return sb.reverse().toString();
    }

    public String decode(String text) {
        BigInteger big = BigInteger.ZERO;
        for (char c : text.toCharArray()) {
            int i = base.indexOf(c);
            if (i == -1) throw new IllegalArgumentException();
            big = big.multiply(radix).add(BigInteger.valueOf(i));
        }

        List<Byte> byteList = new ArrayList<>(text.length());
        while (!BigInteger.ZERO.equals(big)) {
            BigInteger[] parts = big.divideAndRemainder(BI_256);
            int small = parts[1].intValue();
            big = parts[0];
            byteList.add((byte) small);
        }
        Collections.reverse(byteList);

        byte[] r = new byte[byteList.size() - 1];
        int i = 0;
        for (Byte b : byteList) {
            if (i > 0) r[i - 1] = b;
            i++;
        }
        return new String(r);
    }
}

Mode of use:

public class Main {
    public static void main(String[] args) {
        // Passa o alfabeto como parâmetro. Tem 62 caracteres aqui, então são 62 símbolos.
        BaseN bn = new BaseN("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
        String x = "The quick brown fox jumps over a lazy dog";
        System.out.println(x);
        String a = bn.encode(x);
        System.out.println(a); // Escreve "1u9WLfG65OMtVkQWPtWDcC6o8IjI5td5l9DzpilIK4Nyx81tKLRrStPj"
        String b = bn.decode(a);
        System.out.println(b); // Escreve "The quick brown fox jumps over a lazy dog"
    }
}

See here in ideone.

  • 1

    I congratulate you because you have just given me an algorithm that will save me a few hours... plus you use Math_biginteger as I do. I can only vote once, and I already have because otherwise I would vote more often on your answer. Let me say that your solution is complete and the code well structured. Can I use in my code? As for the solution in JAVA is the icing on the cake.

  • @chambelix Of course you can use! : ) I am grateful to have helped, because this from here was very challenging and interesting to do.

  • There is an error in certain string sizes on the Encode line $small=Ord($Parts[1]->toBytes()[0]). But I’ve already caught...thanks again.

  • @chambelix better explain the error. I hate bugs. Which string you used?

  • @chambelix I think I have a solution for this bug, but first I need a test case that plays it to confirm if my fix works. The problem is that I haven’t found any case that reproduces the bug yet, so if you can provide me, I will fix it with pleasure. :)

  • try the string "excellent solution" will work, but if you add one more space between words or put one more character... it will fail. I’m here to finish a code I will only analyze your algorithm later and solve the error. if you do not want to bother then I will answer. mistakes are normal.

  • @chambelix: https://ideone.com/Mu6ZGh - excelente solução the Encode generated p0KO2123BeMgaOUElggUpPOWYFb that generated excelente solução in Code, that is, it worked. However ç and the ã make me want to look for encoding problems.

  • @chambelix PHP Notice: Uninitialized string offset: 0 in /home/miR2Be/prog.php on line 1990 - Okay, I’ll get right on that...

  • to me "excellent solution1" gives me the error "Notice: Uninitialized string offset: 0"... however I tried a direct approach to the problem and put... if (isset($Parts[1]->toBytes()[0])) { $small = Ord($Parts[1]->toBytes()[0]); } Else { $small = 0; } ---> and resolve!

  • @chambelix Corrected. :)

  • 1

    have seen :) once again my congratulations... corrected!

Show 6 more comments

1

You cannot have something represented in Base64 with just this character range (A-Z, a-z and 0-9) as this range has only 62 characters and Base64 requires 64 distinct representations.

So if you don’t want the characters + and / in its representation in Base64, you will need to replace them with something else outside this range. You will have to choose a replacement for the = also because it can appear in a representation in Base64 in order to complete the size of the last block.

What has been used in practice, when the need for example to include a Base64 representation in a URL, is to replace the set {+ / =} for {- _ ,}, respectively.

In a quick search, it seemed to me that PHP doesn’t natively have a function for this, so you’ll have to implement your own.

Even if your intention is not to use in URL, this idea should suit you:

function base64url_encode($plainText) {

    $base64 = base64_encode($plainText);
    $base64url = strtr($base64, '+/=', '-_,');
    return $base64url;   
}

function base64url_decode($plainText) {

    $base64url = strtr($plainText, '-_,', '+/=');
    $base64 = base64_decode($base64url);
    return $base64;   
}

Update: It just occurred to me too that you can convert your bytes to Hexadecimal, which is represented only by 0-9 and A-F. The resulting string is much larger than in the Base64 representation, but it may suit you. I don’t know PHP function that does this but the logic of converting bytes to hexadecimal is quite simple.

  • I don’t think that’s what he wants. From what I understand, he wants a base-62 coding algorithm, not to represent a base-64 with 62 distinct symbols. And yet, there are ways you represent base-64 with 62 symbols. For example, you use 61 symbols normally (0-9, A-Z and a-y), and then uses za, zb and zc to match the remaining three distinct symbols, where z cannot appear at the end and not followed by something other than a, b or c.

  • @Victor There are 1001 ways to represent any value anyway, and you can invent one more - just like Neston. Just do a little job like this that stuck in my answer :-)

  • yes, exactly. Is that in your answer you say You cannot have something represented in Base64 with just this character range, and I just showed you a way to do that. :)

  • @Victor, we all know that between heaven and earth there are more things than our philosophy dreams; so we write "yes", "nay", "at all times", "never" without fear because we know that to some degree or at some time or from some point of view something will be different from what was stated - and the wise read us without taking these words to the extreme because they know the same and they know that we knew when writing. So don’t worry, just write an answer with your ideas and it will have the potential to help someone. If it does not help, the OS mechanism (votes, moderation..) takes care that it does not hinder it either.

  • @Caffé thank you very much for your time but as I mentioned in my question I mentioned that "I know I can replace the characters mentioned for the purpose of a URL" and I know very well the Base64 algorithm its applicability. Your answer is a little out of my question... but I appreciate the time you’ve wasted.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.