Check-in [28876e8682]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Search updated
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 28876e8682702dc4fb2e2b068845fc46391f4a84
User & Date: bernd 2020-01-28 19:19:15.872
Context
2020-01-30
12:35
Bump version number check-in: a6fdb1eb15 user: bernd tags: trunk, 0.9.7-20200130
2020-01-28
19:19
Search updated check-in: 28876e8682 user: bernd tags: trunk
16:50
Avalanche tree added to distributed data check-in: 0d21a6c216 user: bernd tags: trunk
Changes
Unified Diff Ignore Whitespace Patch
Added wiki/search.md.






























































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Searching with net2o #

Privacy enhanced networks (“darknets”) are notoriously difficult to search
with traditional approaches.  For a start, crawlers will not work, because
most material is not actually public.  Furthermore, you actually don’t want to
send your queries to a centralized search engine, anyhow — there is a reason,
why net2o is a peer to peer network.  This means you want a non–traditional
search engine, anyhow.

So how is search in net2o going to be implemented?

## Search locally ##

Indexing local data is easy.  Your local data is available to you, and you can
create a keyword,document key–value–list without problems.  Too frequent terms
are not stored (stopwords), and multiple search terms are a self join of the
database query.  Local search is the starting point of a peer 2 peer search;
it relies on humans crawling the net instead of robots.  When everybody is
participating in global search, this local search as a start is good.  It has
a number of advantages:

* The “crawler” sees the same thing as a normal user — SEOs can't provide
  misleading, different content to crawlers
* People do rank stuff they see, by clicking on likes and resharing.
  Providing dislikes allows to rank down, too.
* People can add tags when they reshare, making a search more specific
* Hashtags can improve searching by highlighting which terms are supposed to
  be important and relevant
* Dictionary based approaches can identify relevant search terms
  automatically: If a word occurs with a significantly much higher frequency
  than usual, it is likely a relevant term, if it occurs in a title or
  subtitle, it is likely more relevant.  Dictionary based analysis should map
  variants (inflections, spelling mistakes) to the same stem
  (e.g. “searching”, “search”, and “seach”) are the same search term).

Local search is the only search that is possible for closed communication.

## Global decentralized search ##

Only publically available (unrestricted) data is supposed to be globally
searchable.  A peer to peer network however does not guarantee that first
privately posted stuff remains private — this depends on the cooperation of
your peers, or if they decide to leak your message.  For material that is
intented to be publically visible, there needs to be a p2p method to spread
search results.  The indexing is done by local search index, usually by the
author himself (only in the leak case not).  The indexing extracts relevant
search terms (by looking at hash tags and dictionary based automatic search
terms).  Once indexed, the document anchor is submitted to the search term
servers responsible for the relevant terms, together with a short list of who
has this document.

Search engines like Google already have algorithms that distribute the load
within a large data center.  The concept can be used for a decentralized
search:  First stage is an index of responsibility.  So if you search, you ask
the index who carries the list of search results for that term.  To reduce
load of individual carriers of search result lists, frequently used terms can
be combined with secondary terms, e.g. “forth” and “program[ming]” can be
combined to be responsible for searching for that programming language, while
“back and forth” can be combined for that common phrase.

After getting responsible nodes from the index, you ask one of them with
secondary search terms to deliver the self join of that list.  By doing so,
you can now announce responsibility for that particular combination of search
results, and offer that if other people ask for it.  This allows to scale
frequent searches, because the more often people search for a term, the more
copies of a search result list are around.

There’s still a privacy problem that you should not attach search queries to
your permanent ID.  However, there is a severe problem with complete anonymity
for asking (e.g. using hash(search term) as secret key), because that allows
spammers to capture a search term.  It is necessary to tie reputations to
answer search queries.  The good thing is that the original search query list
can be signed by the responsible node, and cache nodes only hand out that
signature, so cache nodes can use throw-away one-time IDs.

Search lists effectively are chat logs with some special properties.  They
contain a reference to the document, a short excerpt that contains the most
relevant search terms, and a list of tags: search terms in decreasing order of
relevance.