How to quickly recover a large PHP array?

Asked

Viewed 673 times

8

In the example below, I have a array in PHP with about 128,000 records (Portuguese language entries) that I retrieved from another file, and use in some applications, to compare the indexes as HashMap.

array(128521) {
  [0] =>
  string(3) "aba"
  [1] =>
  string(3) "aba"
  [2] =>
  string(4) "abá"
  [3] =>
  string(7) "ababás"
  [4] =>
  // ...
}

Question

How can I keep this one safe array so that you can quickly recover it from the hard drive without affecting the application’s performance too much? The focus is to make the array for the classes that will use it as soon as possible.

  • 5

    I was curious as to why I voted against this question. Could the person responsible explain?

  • 5

    I share the curiosity of @bfavaretto, a vote against a question should always be commented. Who votes against it is because they saw something wrong with the question and in that case the comment would help to make the necessary improvements.

2 answers

6

In fact it’s completely unfeasible to do that.

128,000 indices in an array, is a large amount to be manipulated and/or compared.

The ideal is to keep such information in the database and search it based on a minimum limit of parameters.

But I have maybe two solutions:

Physical file

This array be kept as a physical file (which extension you want, in json format)

[EDITED] Idea

Just as a complement, in your case, I think it’s best to create a file for each "letter" of the alphabet.

  • a. json
  • b. json
  • c.json

Unless you compare, include values between the array value, for example: Ada

  • Ada
  • adá
  • guava
  • giraffe (joke, haha)

With this I believe that the search, is faster (in a single file), but of course, it will increase the number of requests on the server, and to help in this case, leave any file information inline (), so that the size is reduced/compressed to the maximum.


Cached page (json/xml)

I don’t like and/or recommend xml, but feel free

Have a page (route) of your system/site, which contains this array also in json format, however, such page will be cached for the time determined by you.

That way, when other applications go to the specific page (for example: http://www.examplo.com.br/ptbr.json or http://www.examplo.com.br/ptbr.php - the extension itself does not matter, but rather how you will handle the request), you return that page, which in this case will be cached, and will not be loaded again.


Doubt

Do you use any framework? The vast majority of them already have a cache system ready, and an excellent administration of the extension/response/protocol type of their request/route.


Some links to help you

I hope I’ve helped.

  • Yeah, I have a cache system. I believe that accessing the bank is much slower than using a physical file, since I will never search with iteration, but with Hash. I thought about serializing, since this process is faster than JSON Parsing.

  • @Calebeoliveira I heard that the unserialize is slower than the json_decode, it is worth it you test the performances of the two.

  • @Calebeoliveira certainly serialize is a yes option. The ideal is to test the performance, which in this case I can’t tell you for sure (I need to test, just like you). After checking the best way, just create a process (cron maybe), to update this x file in x hours, and/or only when there is modification in the database (from this table).

  • @bfavaretto, I did the tests, the unserialize is faster. @Patrick, it’s a static file, it doesn’t change.

  • @Calebeoliveira edited the post as a possible idea for your question. Doubt: how long the file is (KB)?

  • @Patrickmaciel, the file has about 5MB.

  • 2

    You could use a Nosql solution for this, Redis, Mongodb something that has a performance that suits you.

  • Okay, I’ll look into these tools too.

Show 3 more comments

3


In addition to Nosql options and a layer of Memcached as noted above, a second option would be search engines. This type of problem has characteristics that I would treat with indexing. You will need the entire array or just query certain words according to the search criteria?

If you need the second option I would use a tool like the Apache Solr or Elasticsearch to create a dedicated index (e.g., index containing a tuple [hash / word]). Well-tuned indexes are very fast and already have intelligent internal cache policies, able to return frequent queries "instantly" even in an index with millions of entries.

  • did not understand the meaning of dedicated index. You could explain?

  • Hello, Caleb, this is a sub-area of Information retrieval quite complex, but I will try to explain it in the best possible way. Starting from Google’s case, it indexes zillions of documents (html, pdfs, pages, images, etc.), that is, when a Google Doc finds a document it is pre-processed and the information relevant to it is stored in an index, so that Google users can search and get "instant results".

  • In this sense, when you make a query the search platform is not hitting directly a data, but an index optimized to answer that type of search (following the ranking criteria that "sort" the data according to relevance). Apache Solr and Elasticsearch are search platforms for this purpose. In short, you build / model your own indexes, feed them with the necessary information and these tools use the library Apache Lucene to build efficient and scalable indices.

  • These tools can handle a lot more "crap" than a database and implement more specific algorithms than a typical Nosql tool (for example, engines for enforcing ranking policies as per the above comment). Also, for gigantic indexes (not yours), these tools are ready to work with partitioning, Clusterization, etc. While I understand that your problem currently is nothing so big; it has color, sound and smell of information retrieval. A specific index for words should meet your needs.

  • Okay, I get it, I’m going to research these tools, maybe they’ll solve my problem.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.