prateek sachan


undergrad,
indian institute of technology delhi, india


UTC: Thursday April 17, 2014
posted on: Friday August 02, 2013
These past days I've been busy trying to come up with a working product with the features that I'd earlier planned to include in this version of Global Search.

Finally, I feel happy to complete this milestone. Both my mentors Tomasz and Aparup have guided me well in this project clearing my doubts every now and then. Tomasz has even installed the Global Search plugin on his website. You may try it out here:global-search.jmuras.com

Feel free to clone my Moodle gs2 branch and try out the product. I need deveopers to try it out and test it for any security leaks. The wiki has been updated with the complete procedure for setting up Global Search. Feel free to contact me with your feedback and comments.

I'm including screenshots of some advance search queries. This example here, indexes two pages and one pdf file generated from the Superman wikipedia page. It takes into account Moodle's Wiki module.

The first screenshot shows a normal search for superman return 3 results.

The second screenshot shows a wildcard search for super*. Clearly, it matches different sets of keywords starting with super.

The third screenshot shows an example of proximity search for "superman dc"~10. This means that results will be shown wherever the two words are encountered within less than 10 words on either ends.

The remaining screenshots show boolean searches. You can clearly figure out the differences between them.




My next work will be to design a search UI page. Suggestions are welcome. You may directly contact me or post on Moodle's developer forum post.
(Continue Reading)
Tags:gsoc
posted on: Monday July 15, 2013
This week I integrated Apache Tika into Moodle to support indexing of Rich Documents like .PDF, .DOC, .PPT etc. Solr's ExtractingRequestHandler uses Tika allowing users to upload binary files to Solr and have Solr extract text from it and then index it, making them searchable.

One has to send the file to Solr via HTTP POST. The following cURL request does the work:

curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@ps.pdf"
ps.pdf file is sent to Solr to extract content from it.
literal.id=1: assigns the id=1 to the Solr Document thus created.
commit=true: Commits the changes to the solr index.
myfile=@ps.pdf: This needs to be a valid relative or absolute path.

Refer the wiki for more options on ExtractingRequestHandler Now, using the PECL PHP SOLR client in Moodle, there isn't a way to get the extracted content and add it to solr document's field. The cURL request creates an all-new Solr Document specifically for the files and adds content to that Solr Document's fields.

Also, the get_content_file_location() function of Moodle that stores the absolute filepath of files is protected. But, there is a predefined add_to_curl_request() function that adds a file to the cURL request.
$curlrequest->_tmp_file_post_params[$key] = '@' . $this->get_content_file_location()

So, keeping all these things in mind I had to come up with the following logic for including the feature of indexing Rich Documents via ExtractingRequestHandler in Global Search.

The access rights will be checked by extracting the $id of the Solr Document and passing it to the forum's access check function.[Full code]

And, here's the code that I've written for the Forum Module.

The above code sends the external files to Solr for extracting content and creating new Solr Documents. I'm not committing the Documents after each cURL request as it would take a lot of time. Hence, after all the documents have been added, I'm execute $client->commit at the end.
(Continue Reading)
posted on: Sunday June 30, 2013
This week I started off starting the Search API functions for Global Search. The idea is to code 3 functions for each module. These will be written in the module's lib.php file.
_get_iterator($from=0)
_search_get_documents($id)
_search_access($id)

The former two functions are used while indexing records while the last one is used to check user permissions for displaying the search results.

The admin has the option to enable a particular module/resource for supporting Global Search through settings. You may view the code here

The first function _get_iterator($from=0) will return a recordset. I've already covered it in Updating Solr Index in Global Search

The second function _search_get_documents($id) creates a SolrInputDocument by including data from the database by specifying fields. An example is shown below:

The tricky part is to correctly structure our indexed records. For example, for the book module, _get_iterator() will return the record of a particular chapter. Hence, each chapter will be a separate SolrInputDocument having solr field id->chapterid.

The third function maintains security by checking Moodle caps and restricting access to prohibited search results. I've already discussed about Global Search security in Handling security in Global Search.
(Continue Reading)
posted on: Thursday June 20, 2013
I implemented the functionality of allowing the admin to delete solr index recently. The code can be seen here.

Solr provides a simple way of deleting indexing using SolrClient::deleteByQuery. I have provided two types of deleting index:
• Delelting the complete index in one-go.
• Deleteing index modularily. (For example, deleting index for records belonging to book and page modules only)

The idea was to make the admin select the delete option: All or let the admin choose the modules. I made these options available to the admin through Moodle Quick Forms. Here is a code snippet:
If the admin chooses checkbox: all the $client->deleteByQuery('*:*') is executed, deleting the entire solr index.

If, on the other hand the admin chooses only some modules to delete their index, the name of the modules are concatenated together separated by a string, stored as a stdClass $data->module and passed as a parameter into the search_delete_index function, thus executing $client->deleteByQuery('modules:' .$data)

That for the first part: deletion.
After deletion, I need to handle the the config settings, so that the admin is able to re-index. This is done by re-setting the values in the config_plugin table. This is done through the below simple code:


Here, $mods will be a simple array containing the names of all modules or only those modules whose index was selected for deletion.
(Continue Reading)
posted on: Wednesday June 19, 2013
Previous week, I started coding the admin page for Global Search. Here are the three indexing configurations that I've planned to implement :
• Adding new documents. (This will be written such that the indexing is resumed from a previous run).
• Deleting index.
• Updating index for the updated records

For updating index pertaining to update/change in a record, solr gives us two options:
• Treat the "updated" record as a whole new SolrDocument and re-index the complete document.
• Perform a partial update by re-indexing only that field which was updated.

The first approach outlined above is pretty simple. The iterator will return a recordset having timemodified from a previous index run. And, those records will be accordingly re-indexed. [As implemented by my mentor Tomasz earlier. See wiki]


The second approach was recently released by Solr. It could be very useful where thousands of documents may have been updated at once, and the first approach consumes a lot of time.

Lets, take an example. Suppose, we have 1000 books in Moodle stored in courseid : 1. The teacher/admin imports all the books to another course, say courseid : 2. So re-indexing all the 1000 books might not be very useful here. All we need to do is update only field: 'courseid' of all the books.

Solr supports several modifiers that atomically update values of a document.

set – set or replace a particular value, or remove the value if null is specified as the new value
add – adds an additional value to a list
inc – increments a numeric value by a specific amount

However, there's no specific PHP approach of doing it but only XML and JSON.
Hence, I will have to use SolrClient::request function to send a raw XML update request to the solr server. Here is a sample code of doing it in PHP.



Followed by the following commands:

$client->request($s);
$client->commit();
$client->optimize();


One thing has to be kept in mind that the string above should be less than 2MB as defined in solrconfig.xml:
multipartUploadLimitInKB="2048000". Running the above code resulted in a string of ~80KB, so we could easily use it for updating fileds in a large set of documents.

However, I've to discuss this second approach with my mentors which I will probably do this week on how to implement this in Global Search.
(Continue Reading)
posted on: Sunday June 16, 2013
Handling security issues will be an integral part of Global Search. Last thing we want is users getting access to prohibited records through search. It will be a huge blow to the project if users get access to documents that they are not premissible to view. For this, the solution will be to filter the results after receiving the XML format of the query response from the solr server.
$client->query($query)->getResponse()


Here, I will be using 3 cases for every search result:
SEARCH_ACCESS_GRANTED
SEARCH_ACCESS_DENIED
SEARCH_ACCESS_DELETED

I will check for every result whether the user has access to view it or not. If the user doesn't have access to it SEARCH_ACCESS_DENIED, that particular result will not be shown to the user.

In the alternative case, if the user is found to have permission to view a particular result SEARCH_ACCESS_GRANTED, then that record will be further checked if it has been deleted or not.
• If it has been deleted SEARCH_ACCESS_DELETED, the index will be updated by deleting that document from the index using deleteByQuery('id:'.$docid)
• If the record still exists, the result is then displayed to the user.

We will be getting only 1000 results from the Solr response Object for a query $query->setRows(1000) and check for access. Once, we have 100 results to be shown to the user (having SEARCH_ACCESS_GRANTED), it will stop checking for permission and will terminate showing those 100 results.
(Continue Reading)
posted on: Sunday June 09, 2013
Well this particular post is dedicated to Data Structures. It will be the first time I would be implementing Trie in real-life situations (apart from the college assignments), hence I thought that a post was in immediate order.

When integrating Apache Solr search engine, one of the most important files is schema.xml. A lot of effectiveness and optimization of the search depends upon it. The file contains details about the different fields of our documents that will be indexed, and the manner in which the indexing will take place.

So, lets talk about Trie for a while. Suppose we have five words:
tall
tea
tee
tim
tom

The above five words could be implemented in the following manner:
t--a--l--l
|
|--e--a
| |
| \--e
|
|--i--m
|
\--o--m

Now, Solr uses this structure to index documents. Following is the declaration of Trie field types in schema.xml.

Suppose, I want to index the integer 4208909997. When Solr indexes this integer it saves the hex equivalents at different precision levels. (FADEDEAD = 4208909997 in hex)

A precisionStep="12" would result in:
FA000000
FADED000
FADEDEAD

A precisionStep="8" would result in:
FA000000
FADE0000
FADEDE00
FADEDEAD

A precisionStep="8" would result in:
F0000000
FA000000
FAD00000
FADE0000
FADED000
FADEDE00
FADEDEA0
FADEDEAD

Now, if solr has to search for FADEDEAx to FADEDEAD, using a precision step of 8 would result in going through all the 16 possibilities to find the record, but just one record in case of precision step of 4.

So clearly, a small precision step makes the query fast but also increases the index size. Hence, I will have to test different cases to come out with the "perfect" schema for solr.
(Continue Reading)
posted on: Saturday June 08, 2013
This week I got started integrating search into Moodle core writing code for search within Moodle's page module. I decided to quickly pick a module and test Solr with it. Things become easy when you actually do stuff and see it happening infront of you. This also gave me the advantage of laying down the search API structure. Thanks to my mentor Aparup on guiding me with it.

The idea is to use the php-pecl-solr extension. It's faster as its build into PHP itself. Some major advantages:
• Its wholly written in C. Hence, its extremely fast. compared to other PHP solr client libraries.
• Object-oriented API that makes it easier and efficient for developers to communicate and interact with Apache Solr Server.
• Documented in PHP.

However, one has to have a dedicated server to install this extension.
There were many doubts concerning the integration of such extension in Moodle. Implecations stating that the extension relies on server. Moodle sites installed on shared servers cannot use this search feature. Also, servers need to have Java hosting. However, my mentor Tomasz has previously tried implementing a search feature in Moodle and informs that there are no search engines written in PHP (Moodle is based on PHP). I feel that as Moodle is majorly used by many colleges and universities across the world who have their own dedicated servers, so using the extension is a safe bet to some extent. Something is always better than nothing.

In the future, we will obviously make it flexible to avoid dependancy on server after seeing the success of this first version of Global Search.

The current official version of php-pecl-solr extension doesn't work with Solr4. There's a minute difference in the client constructor that makes it flexible to use for Solr 3.x or Solr 4.x


However, a patch is available that will be merged into the offical stable release in the future.

I'm maintaining a complete procedure for installing this extension on Moodle docs.

Feel free to go through/or drop in the discussion here.
(Continue Reading)
posted on: Tuesday May 28, 2013
The GSoC results were announced yesterday. After weeks of anticipation, anxiety and horror (yes!) it was exhilirating to see my proposal for Global Search being accepted by Moodle. Two months of hard-work, sleepless nights, bunking lectures and screwing up my exams didn't go waste after all. No matter how much you try or work hard one is always pessimistic about the things that aren't under your own control.

I love Open Source, sort of an addiction lately. The major reason being the awesome developers. Hats off to them. They are so smart, witty and opinionated. Like a piece of optimized code, their replies too are exact, precise and just to the point. Nothing less or nothing more. They'll outsmart you everytime with their double-meaning wits. Also, they can talk indefinitedly about anything and everything: from earthquakes to cats and from North Korea to Nuclear Power.

I first used and came to now about Moodle by using it in college. Moodle is currently installed in my college intranet server for course management. I want to thank the Moodle selection team for selecting my proposal for the project and the whole Moodle community for giving such a great feedback and help whenever I needed it. Also, I would like to deeply thank Aparup Banerjee and Tomasz Muras for helping me out and tolerating my random queries from time to time.

For the next three-four months, I will be involved in developing a full-site search in Moodle. The idea is to adopt the widely-used Open Source Solr search platform from the Apache Lucene Project and integrate it in Moodle.
(Continue Reading)
Tags:gsoc