[coll.cgi V=1 RA=18.188.107.57]

JC's ABC search bot

JC's ABC search bot


This directory contains the search bot for JC's Tune Finder. This is a program that looks for web sites with files in the ABC music notation. When it finds one, it extracts assorted interesting musical information from the file, and adds it to the Tune Finder's database. Meanwhile, people are connecting and asking about tunes, downloading them, or requesting them in any of the several output formats that the Finder supplies.

Some of the interesting things here:

abcbot
is the search bot itself. It's a perl program that uses the URLs file as a list of starting points and places to avoid. Also, abcbot expects a list of hosts on the command line, and it scans only those sites. It keeps data for each host in a file in the hst/ directory.
ABCsearch
queries several big search sites for "ABC music notation". This is the best set of keywords that we've found to locate ABC sites. About 20% of our sites were found this way. It writes info for each into a file in the add directory.
Hosts2html
rebuilds the index files in ndx/ These are the files used by the Tune Finder. The index files are rebuilt after a search run. They may also be rebuilt at any time if there are problems, or if we've done a special scan of one or more sites.
Makefile
is a conventional unix makefile. We use it to drive the search process, which is started by hand once a month or so.
NewURL
takes a URL and does a scan for ABC files with the URL as a starting point. If the scan is successful, the URL should usually be added to URLs.
scandata
is a file showing statistics from the most recent scans, one line per host. This usually includes "false positive" hosts from the big search sites, so we can see how successful ABCsearch was.
UpdateHosts
is a script that drives the search run. We normally call abcbot for a single host. UpdateHosts has the task of keeping track of which hosts have been scanned, and starting up abcbot for each host. It uses both the hst/ and add directories to decide which hosts to scan.
findrobotstxt
is a little program to list the URLs of all robots.txt files in our ABC hosts. It may be slow, because it queries every host, and sometimes they don't respond. So run it in the background.
hst/
contains one file per host, listing all the interesting files and ABC tunes that abcbot found on that host.
lck/
contains lockfiles for hosts, so that we don't get two programs trying to scan the same host.
ndx/
is our copy of the index files used by the Tune Finder. The Tune Finder actually works out of ../ndx/, so the files are linked there after we've rebuilt and verified them here.
sh/
is assorted small scripts that are useful here.
stat/
contains past statistics files of various types. It should probably be purged occasionally.
webcat
is a program that does downloads. This is a separate program because of an intractable problem: TCP connections to web servers sometimes block permanently in the connect() call. This happens on several OSs, and there seems to be no solution. So abcbot calls webcat as a subprocess. If webcat doesn't respond or exit within the timeout period, we kill it and go on. This is a bit of a waste of cpu time, but it solves the problem.
typegifpngpdfsvgKkeysbarsstaffssizeMOinfoRWfiletitles
1 data---- - - - -0 - - - - --L -
2 dir---- - - - -64 - - - - -1/1
3 data---- - - - -2710 - - - - -507 -
4 ----- - - - -103242 - - - - -ABCbot-kendy* -
5 ----- - - - -95965 - - - - -ABCbot-lyrics* -
6 ----- - - - -103213 - - - - -ABCbot-minya* -
7 ----- - - - -106133 - - - - -ABCbot-trillian* -
8 ----- - - - -6297 - - - - -ABCsearch* -
9 ----- - - - -6297 - - - - -ABCsearch-trillian* -
10 ----- - - - -1387 - - - - -ABCsummary* -
11 ----- - - - -107 - - - - -Avoid* -
12 ----- - - - -245 - - - - -BadSite* -
13 ----- - - - -5721 - - - - -BadURLs* -
14 data---- - - - -1226 - - - - -CampinFiles -
15 data---- - - - -2110 - - - - -Changes -
16 ----- - - - -456 - - - - -CheckNulls* -
17 ----- - - - -3588 - - - - -Cleanup* -
18 ----- - - - -2820 - - - - -DelNulls* -
19 ----- - - - -272 - - - - -Download* -
20 data---- - - - -5260 - - - - -FranksURLs -
21 ----- - - - -1637 - - - - -GraphLog* -
22 data---- - - - -62 - - - - -Hangups -
23 data---- - - - -9040 - - - - -HdrFiles -
24 ----- - - - -220 - - - - -HostInit* -
25 ----- - - - -1559 - - - - -HostStatDiffs* -
26 data---- - - - -40141 - - - - -HostStatDiffs-20190106 -
27 data---- - - - -40361 - - - - -HostStatDiffs-20190204 -
28 data---- - - - -40728 - - - - -HostStatDiffs-20190304 -
29 data---- - - - -40502 - - - - -HostStatDiffs-20190405 -
30 data---- - - - -40290 - - - - -HostStatDiffs-20190505 -
31 data---- - - - -40581 - - - - -HostStatDiffs-20190605 -
32 data---- - - - -40368 - - - - -HostStatDiffs-20190612 -
33 data---- - - - -40438 - - - - -HostStatDiffs-20190613 -
34 data---- - - - -41023 - - - - -HostStatDiffs-20190708 -
35 data---- - - - -41020 - - - - -HostStatDiffs-20190710 -
36 data---- - - - -40503 - - - - -HostStatDiffs-20190711 -
37 data---- - - - -40505 - - - - -HostStatDiffs-20190812 -
38 data---- - - - -40637 - - - - -HostStatDiffs-20190814 -
39 data---- - - - -40062 - - - - -HostStatDiffs-20190815 -
40 data---- - - - -40133 - - - - -HostStatDiffs-20191030 -
41 data---- - - - -39924 - - - - -HostStatDiffs-20191104 -
42 data---- - - - -40138 - - - - -HostStatDiffs-20191203 -
43 data---- - - - -39785 - - - - -HostStatDiffs-20191205 -
44 data---- - - - -39653 - - - - -HostStatDiffs-20200102 -
45 data---- - - - -39657 - - - - -HostStatDiffs-20200108 -
46 data---- - - - -40296 - - - - -HostStatDiffs-20200217 -
47 data---- - - - -39928 - - - - -HostStatDiffs-20200309 -
48 data---- - - - -40645 - - - - -HostStatDiffs-20200406 -
49 data---- - - - -40138 - - - - -HostStatDiffs-20200410 -
50 data---- - - - -40077 - - - - -HostStatDiffs-20200503 -
51 data---- - - - -40072 - - - - -HostStatDiffs-20200505 -
52 data---- - - - -40072 - - - - -HostStatDiffs-20200604 -
53 data---- - - - -40077 - - - - -HostStatDiffs-20200702 -
54 data---- - - - -41129 - - - - -HostStatDiffs-20200705 -
55 data---- - - - -41490 - - - - -HostStatDiffs-20200706 -
56 data---- - - - -41563 - - - - -HostStatDiffs-20200710 -
57 data---- - - - -41323 - - - - -HostStatDiffs-20200803 -
58 data---- - - - -41557 - - - - -HostStatDiffs-20200902 -
59 data---- - - - -41258 - - - - -HostStatDiffs-20200904 -
60 data---- - - - -41396 - - - - -HostStatDiffs-20201001 -
61 data---- - - - -41396 - - - - -HostStatDiffs-20201003 -
62 data---- - - - -41396 - - - - -HostStatDiffs-20201102 -
63 data---- - - - -41541 - - - - -HostStatDiffs-20201116 -
64 data---- - - - -41889 - - - - -HostStatDiffs-20201205 -
65 data---- - - - -41669 - - - - -HostStatDiffs-20210103 -
66 data---- - - - -41949 - - - - -HostStatDiffs-20210106 -
67 data---- - - - -42094 - - - - -HostStatDiffs-20210108 -
68 data---- - - - -42019 - - - - -HostStatDiffs-20210207 -
69 data---- - - - -42170 - - - - -HostStatDiffs-20210307 -
70 ----- - - - -1512 - - - - -HostStats* -
71 data---- - - - -23740 - - - - -HostStatsData-20190106 -
72 data---- - - - -23870 - - - - -HostStatsData-20190204 -
73 data---- - - - -24000 - - - - -HostStatsData-20190304 -
74 data---- - - - -23951 - - - - -HostStatsData-20190405 -
75 data---- - - - -23535 - - - - -HostStatsData-20190505 -
76 data---- - - - -23594 - - - - -HostStatsData-20190605 -
77 data---- - - - -22309 - - - - -HostStatsData-20190612 -
78 data---- - - - -22227 - - - - -HostStatsData-20190613 -
79 data---- - - - -23990 - - - - -HostStatsData-20190708 -
80 data---- - - - -23990 - - - - -HostStatsData-20190710 -
81 data---- - - - -23949 - - - - -HostStatsData-20190711 -
82 data---- - - - -23616 - - - - -HostStatsData-20190812 -
83 data---- - - - -8300 - - - - -HostStatsData-20190814 -
84 data---- - - - -23691 - - - - -HostStatsData-20190815 -
85 data---- - - - -23732 - - - - -HostStatsData-20191030 -
86 data---- - - - -23445 - - - - -HostStatsData-20191104 -
87 data---- - - - -7763 - - - - -HostStatsData-20191203 -
88 data---- - - - -23462 - - - - -HostStatsData-20191205 -
89 data---- - - - -23418 - - - - -HostStatsData-20200102 -
90 data---- - - - -23363 - - - - -HostStatsData-20200108 -
91 data---- - - - -22178 - - - - -HostStatsData-20200217 -
92 data---- - - - -23563 - - - - -HostStatsData-20200309 -
93 data---- - - - -23066 - - - - -HostStatsData-20200406 -
94 data---- - - - -23026 - - - - -HostStatsData-20200410 -
95 data---- - - - -23078 - - - - -HostStatsData-20200503 -
96 data---- - - - -23701 - - - - -HostStatsData-20200505 -
97 data---- - - - -23662 - - - - -HostStatsData-20200604 -
98 data---- - - - -23701 - - - - -HostStatsData-20200702 -
99 data---- - - - -23871 - - - - -HostStatsData-20200705 -
100 data---- - - - -23096 - - - - -HostStatsData-20200706 -
101 data---- - - - -23139 - - - - -HostStatsData-20200710 -
102 data---- - - - -24317 - - - - -HostStatsData-20200803 -
103 data---- - - - -12046 - - - - -HostStatsData-20200902 -
104 data---- - - - -24407 - - - - -HostStatsData-20200904 -
105 data---- - - - -12040 - - - - -HostStatsData-20201001 -
106 data---- - - - -24485 - - - - -HostStatsData-20201003 -
107 data---- - - - -11323 - - - - -HostStatsData-20201102 -
108 data---- - - - -23264 - - - - -HostStatsData-20201116 -
109 data---- - - - -24260 - - - - -HostStatsData-20201205 -
110 data---- - - - -24597 - - - - -HostStatsData-20210103 -
111 data---- - - - -24753 - - - - -HostStatsData-20210106 -
112 data---- - - - -23582 - - - - -HostStatsData-20210108 -
113 data---- - - - -23539 - - - - -HostStatsData-20210207 -
114 data---- - - - -24715 - - - - -HostStatsData-20210307 -
115 ----- - - - -10802 - - - - -Hosts2html* -
116 ----- - - - -10799 - - - - -Hosts2html-kendy* -
117 ----- - - - -11142 - - - - -Hosts2html-test* -
118 ----- - - - -10406 - - - - -Hosts2html-trillian* -
119 ----- - - - -11063 - - - - -Hosts2ndx* -
120 ----- - - - -108 - - - - -Ignore* -
121 ----- - - - -168 - - - - -IgnoreSite* -
122 ----- - - - -220 - - - - -InitHosts* -
123 data---- - - - -85 - - - - -J -
124 ----- - - - -257 - - - - -KillSite* -
125 data---- - - - -397 - - - - -KnownProblems -
126 data---- - - - -185 - - - - -L -
127 ----- - - - -422 - - - - -Lfind* -
128 ----- - - - -165 - - - - -Lidh* -
129 ----- - - - -144 - - - - -ListNewSites* -
130 ----- - - - -4215 - - - - -ListSplit* -
131 ----- - - - -5436 - - - - -Ln* -
132 data---- - - - -7004 - - - - -Makefile -
133 ----- - - - -459 - - - - -NTT* -
134 ----- - - - -48 - - - - -NewSites* -
135 ----- - - - -177 - - - - -NewTunes* -
136 ----- - - - -4300 - - - - -NewURL* -
137 ----- - - - -35 - - - - -NewURLs* -
138 ----- - - - -2112 - - - - -NewUpdateHost* -
139 ----- - - - -122 - - - - -Nils* -
140 data---- - - - -36760 - - - - -Scan2004 -
141 data---- - - - -310409 - - - - -Scan2005 -
142 data---- - - - -471227 - - - - -Scan2006 -
143 data---- - - - -439322 - - - - -Scan2007 -
144 data---- - - - -489086 - - - - -Scan2008 -
145 data---- - - - -425306 - - - - -Scan2009 -
146 data---- - - - -437104 - - - - -Scan2011 -
147 data---- - - - -539663 - - - - -Scan2012 -
148 data---- - - - -499467 - - - - -Scan2013 -
149 data---- - - - -507681 - - - - -Scan2014 -
150 data---- - - - -383566 - - - - -Scan2015 -
151 data---- - - - -375289 - - - - -Scan2016 -
152 data---- - - - -388394 - - - - -Scan2017 -
153 data---- - - - -265729 - - - - -Scan2018 -
154 data---- - - - -211256 - - - - -Scan2019 -
155 data---- - - - -380712 - - - - -Scan2020 -
156 ----- - - - -1476 - - - - -ScanCount* -
157 ----- - - - -186 - - - - -SlowSites* -
158 ----- - - - -1862 - - - - -SummaryDiffs* -
159 ----- - - - -2068 - - - - -TT* -
160 ----- - - - -400 - - - - -TTcount* -
161 data---- - - - -513 - - - - -TestSearch1 -
162 data---- - - - -4462 - - - - -ToDo -
163 ----- - - - -3237 - - - - -TuneSearch* -
164 ----- - - - -582 - - - - -TuneSearch1* -
165 ----- - - - -2497 - - - - -TuneURLs* -
166 ----- - - - -249 - - - - -U* -
167 ----- - - - -2707 - - - - -UH* -
168 ----- - - - -203 - - - - -UP* -
169 ----- - - - -4300 - - - - -URL* -
170 data---- - - - -71167 - - - - -URLs -
171 data---- - - - -31765 - - - - -URLs-kendy -
172 ----- - - - -137 - - - - -Uadd* -
173 ----- - - - -226 - - - - -Uhst* -
174 ----- - - - -13542 - - - - -UpdateAllHosts* -
175 ----- - - - -2707 - - - - -UpdateHost* -
176 ----- - - - -2707 - - - - -UpdateHosts* -
177 data---- - - - -745148 - - - - -ZIPfiles -
178 ----- - - - -130599 - - - - -abcbot* -
179 ----- - - - -130599 - - - - -abcbot-20210328* -
180 ----- - - - -130609 - - - - -abcbot-20211024* -
181 ----- - - - -57428 - - - - -abccat* -
182 ----- - - - -102 - - - - -abccount* -
183 ----- - - - -15605 - - - - -abcextract* -
184 ----- - - - -27269 - - - - -abcinhtml* -
185 dir---- - - - -64 - - - - -add/add
186 dir---- - - - -64 - - - - -all/all
187 data---- - - - -253 - - - - -badfiles -
188 ----- - - - -58 - - - - -badsites* -
189 dir---- - - - -256 - - - - -bin/bin
190 data---- - - - -0 - - - - -bot -
191 dir---- - - - -96 - - - - -cfg/cfg
192 ----- - - - -115 - - - - -checkrun* -
193 data---- - - - -1510 - - - - -count -
194 dir---- - - - -320 - - - - -data/data
195 dir---- - - - -64 - - - - -del/del
196 dir---- - - - -64 - - - - -doc/doc
197 ----- - - - -663 - - - - -docx2txt* -
198 dir---- - - - -64 - - - - -event/event
199 ----- - - - -396 - - - - -ffind* -
200 ----- - - - -627 - - - - -findrobotstxt* -
201 dir---- - - - -128 - - - - -gavving/gavving
202 ----- - - - -221 - - - - -gmtime* -
203 ----- - - - -184 - - - - -grepbot* -
204 ----- - - - -510 - - - - -greplogs* -
205 ----- - - - -93 - - - - -greptunes* -
206 ----- - - - -158 - - - - -gsl* -
207 ----- - - - -93 - - - - -gt* -
208 ----- - - - -1884 - - - - -hrestore* -
209 dir---- - - - -64 - - - - -hst/hst
210 ----- - - - -88 - - - - -ht* -
211 ----- - - - -16314 - - - - -htmltext* -
212 ----- - - - -13930 - - - - -httpTuneInfo* -
213 ----- - - - -30072 - - - - -httpcat* -
214 ----- - - - -30072 - - - - -httpcat-trillian* -
215 ----- - - - -30072 - - - - -httpget* -
216 ----- - - - -175 - - - - -httptest* -
217 ----- - - - -106 - - - - -httpurge* -
218 ----- - - - -27193 - - - - -hzcat* -
219 dir---- - - - -128 - - - - -kendy/kendy
220 ----- - - - -1305 - - - - -l2h* -
221 dir---- - - - -96 - - - - -last/last
222 dir---- - - - -64 - - - - -lck/lck
223 dir---- - - - -64 - - - - -lib/lib
224 ----- - - - -58 - - - - -listbadsites* -
225 dir---- - - - -64 - - - - -log/log
226 ----- - - - -2048 - - - - -lynxcat* -
227 dir---- - - - -160 - - - - -minya/minya
228 dir---- - - - -160 - - - - -mirror/mirror
229 ----- - - - -290 - - - - -mklast* -
230 dir---- - - - -64 - - - - -ndx/ndx
231 dir---- - - - -64 - - - - -new/new
232 dir---- - - - -64 - - - - -nul/nul
233 dir---- - - - -64 - - - - -old/old
234 data---- - - - -244 - - - - -patterns -
235 data---- - - - -3333 - - - - -pf -
236 dir---- - - - -64 - - - - -pm/pm
237 dir---- - - - -64 - - - - -program/program
238 dir---- - - - -288 - - - - -proto/proto
239 ----- - - - -9726 - - - - -purgehost* -
240 ----- - - - -9735 - - - - -purgehst* -
241 data---- - - - -11603 - - - - -rc -
242 ----- - - - -2175 - - - - -relinkdirs* -
243 ----- - - - -1884 - - - - -restorehst* -
244 ----- - - - -459 - - - - -rmTT* -
245 ----- - - - -59 - - - - -rmwc* -
246 dir---- - - - -96 - - - - -rscds/rscds
247 ----- - - - -2880 - - - - -rsdir* -
248 ----- - - - -1719 - - - - -rshost* -
249 ----- - - - -1719 - - - - -rssite* -
250 dir---- - - - -3552 - - - - -save/save
251 ----- - - - -13812 - - - - -scanHosts* -
252 data---- - - - -82957 - - - - -scandata -
253 dir---- - - - -4000 - - - - -sh/sh
254 ----- - - - -1656 - - - - -showbadsites* -
255 data---- - - - -0 - - - - -start2 -
256 data---- - - - -0 - - - - -start3 -
257 dir---- - - - -18624 - - - - -stat/stat
258 data---- - - - -3253 - - - - -stats -
259 ----- - - - -62 - - - - -taildupl* -
260 dir---- - - - -96 - - - - -test/test
261 data---- - - - -117 - - - - -testU -
262 ----- - - - -126047 - - - - -testbot* -
263 data---- - - - -50162 - - - - -testcat -
264 ----- - - - -48464 - - - - -testwebcat* -
265 dir---- - - - -64 - - - - -tmp/tmp
266 ----- - - - -61 - - - - -todel* -
267 ----- - - - -11 - - - - -tonew* -
268 data---- - - - -11771 - - - - -tqqq -
269 ----- - - - -23783 - - - - -trabc* -
270 dir---- - - - -256 - - - - -trillian/trillian
271 ----- - - - -34 - - - - -ts* -
272 ----- - - - -221 - - - - -ut* -
273 ----- - - - -154 - - - - -utf8test* -
274 ----- - - - -28445 - - - - -w3cat* -
275 data---- - - - -2879 - - - - -wc -
276 ----- - - - -53228 - - - - -webcat* -
277 ----- - - - -75 - - - - -zapdir* -