[00:00:00] New patchset: Nemo bis; "(bug 46589) Add localised/v2 logos for Wikipedias without one (second installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56084 [00:08:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 00:08:47 UTC 2013 [00:08:56] RECOVERY - Puppet freshness on db1052 is OK: puppet ran at Wed Mar 27 00:08:51 UTC 2013 [00:09:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:10:24] RECOVERY - Puppet freshness on mw13 is OK: puppet ran at Wed Mar 27 00:10:21 UTC 2013 [00:10:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 00:10:46 UTC 2013 [00:11:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:12:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 00:12:35 UTC 2013 [00:12:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:14:34] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 00:14:30 UTC 2013 [00:14:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:20:05] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [00:27:37] New review: Asher; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [00:29:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56081 [00:30:54] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1001, adding db1052 at a warmup weight' [00:31:01] Logged the message, Master [00:34:24] !log asher synchronized wmf-config/db-eqiad.php 'db1052 to full weight' [00:34:31] Logged the message, Master [00:39:51] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:21] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:21] PROBLEM - swift-account-auditor on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:31] PROBLEM - SSH on ms-be3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:31] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:31] PROBLEM - swift-container-updater on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:31] PROBLEM - DPKG on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:31] PROBLEM - swift-object-replicator on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:32] PROBLEM - swift-object-updater on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:32] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:51] PROBLEM - swift-object-server on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:51] PROBLEM - swift-account-replicator on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:01] PROBLEM - swift-container-server on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:11] PROBLEM - swift-account-server on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:11] PROBLEM - swift-object-auditor on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:35] grrr i look at ms-be3 [00:53:16] yep, looks ike swift crashed [00:54:02] !log rebooting ms-be3 [00:54:09] Logged the message, Mistress of the network gear. [00:54:31] PROBLEM - NTP on ms-be3 is CRITICAL: NTP CRITICAL: No response from NTP server [00:55:21] !log ms-be3 was crashed with stack traces for swift-container [00:55:27] Logged the message, Mistress of the network gear. [00:56:41] RECOVERY - swift-account-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [00:56:41] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:56:51] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [00:57:02] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [00:57:12] RECOVERY - swift-object-auditor on ms-be3 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [00:57:12] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:57:12] RECOVERY - swift-account-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [00:57:21] RECOVERY - SSH on ms-be3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:57:21] RECOVERY - swift-container-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [00:57:21] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [00:57:21] RECOVERY - DPKG on ms-be3 is OK: All packages OK [00:57:31] RECOVERY - swift-object-updater on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [00:57:31] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [00:57:31] RECOVERY - swift-object-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [01:06:19] New patchset: Nemo bis; "(bug 46589) Add localised/v2 logos for Wikipedias without one (second installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56084 [01:08:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:10:14] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:14:11] New patchset: Nemo bis; "(bug 46589) Add localised/v2 logos for Wikipedias without one (second installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56084 [01:16:00] Change abandoned: Nemo bis; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56084 [01:31:16] New patchset: Nemo bis; "(bug 46589) Add localised/v2 logos for Wikipedias without one (second installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56097 [01:36:10] New patchset: Odder; "Add tz database time zone settings for wikis in Maldivian language Adding tz database time zone settings for dv.wikipedia and dv.wiktionary ('Indian/Maldives' = UTC+5:00). Bug: 46351" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56098 [01:48:46] New patchset: Odder; "Add tz database time zone settings for wikis in Maldivian language. Removing unnecessary comma from the end of the line Bug: 46351" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56098 [02:04:57] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:07:08] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:08:37] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [02:09:39] PROBLEM - Puppet freshness on mw1025 is CRITICAL: Puppet has not run in the last 10 hours [02:09:39] PROBLEM - Puppet freshness on mw1121 is CRITICAL: Puppet has not run in the last 10 hours [02:10:01] New review: Odder; "All the new links are working; the files have also been protected on Commons against moves and reupl..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56097 [02:10:37] PROBLEM - Puppet freshness on mw1130 is CRITICAL: Puppet has not run in the last 10 hours [02:14:37] PROBLEM - Puppet freshness on mw1120 is CRITICAL: Puppet has not run in the last 10 hours [02:15:37] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [02:19:11] !log LocalisationUpdate completed (1.21wmf12) at Wed Mar 27 02:19:11 UTC 2013 [02:19:18] Logged the message, Master [02:23:39] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [02:44:56] !log LocalisationUpdate completed (1.21wmf11) at Wed Mar 27 02:44:56 UTC 2013 [02:45:02] Logged the message, Master [02:47:22] wikibugs has gone quiet. [02:47:37] And there are reports of Gerrit e-mail strangeness. [02:50:01] Hmm, gerrit-wm seems fine in #mediawiki. [03:03:27] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:05:37] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:20:35] New patchset: Odder; "(bug 45066) Disable anonymous page creation at tr.wikipedia Disabling page creation for anonymous users on the Turkish Wikipedia per community consensus. Bug: 45066" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56101 [03:42:32] New review: Odder; "Please set $wgAutoConfirmAge = 345600 per https://bugzilla.wikimedia.org/show_bug.cgi?id=44285#c3 an..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56055 [04:06:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:07:54] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:08:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:07:55 UTC 2013 [04:08:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:09:15] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:09:11 UTC 2013 [04:09:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:24] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:10:23 UTC 2013 [04:10:47] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:24] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:11:20 UTC 2013 [04:11:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:12:14] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:12:13 UTC 2013 [04:12:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:13:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:12:58 UTC 2013 [04:13:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:16:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 04:16:43 UTC 2013 [04:17:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:51:07] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [04:55:33] !log purging BayesLearning files older than 70 days on mchenry [04:55:39] Logged the message, Master [04:56:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [05:01:57] !log disabling Bayes-collector cron on sanger [05:02:03] Logged the message, Master [05:15:08] New patchset: Tim Starling; "Add a cron job to clean up old MW logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55003 [05:19:59] !log on fluorine: running mw-log-cleanup once to test Id2196e7b [05:20:05] Logged the message, Master [05:25:01] New patchset: Tim Starling; "Add a cron job to clean up old MW logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55003 [05:26:53] New review: Tim Starling; "PS3: set executable bit on script in git" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/55003 [05:26:55] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55003 [05:33:49] New patchset: Tim Starling; "Maybe not once per minute from 02:00 to 02:59" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56103 [05:35:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56103 [06:04:17] New patchset: Tim Starling; "Move scap source location from fenari to tin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56104 [06:04:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:06:58] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:10:46] New patchset: Tim Starling; "In sync-dir, actually perform the syntax check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56105 [06:31:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 06:31:24 UTC 2013 [06:31:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:49:45] New patchset: Tim Starling; "Basic puppetization of dsh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56107 [06:49:47] New patchset: Tim Starling; "Remove some node lists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56108 [07:04:21] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:06:31] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [07:10:41] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:14:41] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 07:14:31 UTC 2013 [07:15:21] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:06:38] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:07:48] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 08:07:42 UTC 2013 [08:08:38] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:08:49] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:10:48] PROBLEM - Puppet freshness on mw1095 is CRITICAL: Puppet has not run in the last 10 hours [08:14:38] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 08:14:30 UTC 2013 [08:14:38] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:23:48] PROBLEM - Puppet freshness on mw62 is CRITICAL: Puppet has not run in the last 10 hours [08:24:48] PROBLEM - Puppet freshness on mw1135 is CRITICAL: Puppet has not run in the last 10 hours [08:35:07] PROBLEM - Puppet freshness on mw51 is CRITICAL: Puppet has not run in the last 10 hours [09:05:12] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:06:22] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:31:46] New patchset: Odder; "(bug 43863) Enabled wgImportSources on the Spanish Wikivoyage. Added eswiki, meta, commons, en.voy, de.voy, fr.voy, it.voy, nl.voy, pt.voy, ru.voy, and sv.voy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56113 [09:36:59] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [09:36:59] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:36:59] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:54:40] wb Nikerabbit [09:55:50] Nemo_bis: uga [09:56:17] !log Testing Translate bug fixes on test.wikipedia.org [09:56:25] Logged the message, Master [10:05:07] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:07:18] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:08:17] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [10:14:03] New review: Nemo bis; "Leslie, what's stopping this? Apart from moving to hume as you said, do I need to do something else?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [10:16:18] Nikerabbit: wikiapiary compensates the lack of WMF monitoring in part, though http://wikiapiary.com/wiki/Wikipedia_Test_Wiki [10:17:17] with low resolution though [10:20:17] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [10:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:04] New patchset: Odder; "(bug 45638) Modify user group rights on it.wikivoyage Modified wgAddGroups and wgRemoveGroups; changed user rights for autoconfirmed, added patroller group." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56118 [10:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:04:27] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:06:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:38] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:07:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:13:59] ugh [11:14:10] change propagation lag on wikidata is rising fast [11:14:33] anyone around to approve and deploy https://gerrit.wikimedia.org/r/#/c/55904/ ? [11:15:28] uga [11:24:20] hi hashar [11:24:26] lo [11:33:37] it looks like these jobs are expected to overlap, am I understanding that right? how does that work? [11:34:09] are there any stats what is the waiting time for jobs to be run? [11:35:24] I guess I need to understand more about what happens once they get to the dispatcher log [11:39:07] apergos: yes, correct [11:39:39] apergos: piping to the logs is indeed a problem though :/ [11:41:03] so if they can run for 900 (I guess seconds) and they run every 5 minutes (= 300 seconds) isn't that going to be a problem with multiple invocations trying to write at the same time? [11:43:39] apergos: write where? [11:44:10] apergos: there is no problem on the database level. the dispatcher is specifically designed to run multiple instances in parallel. [11:44:12] into /var/log/wikidata/dispatcher*.log [11:44:24] that depends on how the OS handles pipes, i guess [11:44:29] that *might* be a problem [11:44:38] as in: some log lines getting lost [11:44:45] but nothing critical [11:44:58] i'm trying to test this locally. aude said it works fine on her setup [11:45:33] well you would want to test it with cases where you have overlap [11:45:54] or just run two of them with the redirect in two different sessions and see what it does [11:46:36] I expect you are going to have some garbled log entries; do you use these for anything? (if not maybe you want to redirect them to /dev/null instead) [11:47:05] New patchset: Odder; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56129 [11:47:05] New patchset: Odder; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [11:47:26] hi [11:47:32] apergos: we need the logs for now, yes. they tell us how, why and where stuff is lagging [11:47:46] apergos: some lines may get garbled. i'm not too worried about that [11:47:52] I see [11:47:54] it's not like anything is absolutelöy relying on the, [11:47:57] *them [11:48:01] so no one is running stats off of them or anything [11:48:07] a quick test on my local box shows no issues [11:48:08] Change abandoned: Odder; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [11:48:12] apergos: right, no stats [11:48:16] ok [11:48:17] apergos: no, we do stats off the database directly [11:48:19] just to see if something is going wrong [11:49:10] * aude has in my crontab [11:49:11] /usr/local/bin/mwscript extensions/Wikibase/lib/maintenance/dispatchChanges.php --wiki enwikidata --max-time 900 2>&1 >> /var/log/wikibase/dispatcher.log [11:49:29] same way done on hume, except log file might be in a different place [11:49:49] every 5 min (and i have 2 cronjobs) [11:50:01] * aude not stress testing it though, but needs to do that [11:50:02] Change restored: Odder; "Proper version with bug number." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [11:50:09] apergos: is there way to run jobqueue manually for a wiki? [11:50:14] Change abandoned: Odder; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56129 [11:50:35] there is a way to run the next set of jobs for a given wiki yes [11:50:58] it didn't merge it [11:52:11] I wonder what's changed in the workflow now [11:53:37] it's verified by jenkins, +2 by me, what more can it want? [11:57:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55904 [11:57:33] tried again and it took it [11:57:34] weird [11:57:56] thanks apergos [11:58:22] live next time puppet runs [11:58:24] yay! [11:59:14] apergos: thanks! [12:01:18] apergos: once it has been running for a few minutes, could you give us a tail of the log? [12:01:37] i want to see whether it scales like I expect it to [12:03:18] on hume? [12:03:53] New review: Hashar; "Annnnnd this is a bug in puppet :( See https://projects.puppetlabs.com/issues/2053#note-18" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54970 [12:04:02] and which log do toy want, dispatcher or dispatcher2? [12:04:33] DanielK_WMDE: [12:06:18] apergos: and the way is? [12:06:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:06:52] huh? [12:07:58] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:08:38] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:08:29 UTC 2013 [12:08:49] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [12:08:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:09:00] Nikerabbit: I didn't get your question [12:09:48] PROBLEM - Puppet freshness on mw1025 is CRITICAL: Puppet has not run in the last 10 hours [12:09:48] PROBLEM - Puppet freshness on mw1121 is CRITICAL: Puppet has not run in the last 10 hours [12:10:48] PROBLEM - Puppet freshness on mw1130 is CRITICAL: Puppet has not run in the last 10 hours [12:11:08] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:10:58 UTC 2013 [12:11:49] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:13:39] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:13:32 UTC 2013 [12:13:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:14:37] apergos: both logs, please [12:14:48] just put them on a pastebin [12:14:48] i'm off for lunch [12:14:48] thanks! [12:14:48] PROBLEM - Puppet freshness on mw1120 is CRITICAL: Puppet has not run in the last 10 hours [12:14:48] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:14:41 UTC 2013 [12:15:48] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [12:15:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:15:58] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:15:48 UTC 2013 [12:16:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:16:48] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:16:47 UTC 2013 [12:17:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:18:28] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 12:18:20 UTC 2013 [12:18:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:20:53] apergos: how to run jobs for testwiki? [12:21:17] oh. sorry! [12:21:32] you need to be on a jobqueue host... oh do you mean on a local instance? [12:21:43] apergos: test.wikipedia.org [12:22:14] hmm probably still need to be on a jobqueue host [12:22:22] just a sec I'll find the command [12:23:48] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [12:23:49] DanielK_WMDE: http://p.defau.lt/?Ob9DpqN_O0TR9JEZHgX4pw [12:23:56] ok now lemme find the job queue thing [12:25:38] php MWScript.php runJobs.php --wiki="$db" --procs="$forkcount" --type="$type" --maxtime=$hpmaxtime this is how it gets run on our hosts [12:27:56] types should be one of sendMail enotifNotify uploadFromUrl MoodBarHTMLMailerJob ArticleFeedbackv5MailerJob RenderJob [12:28:13] apergos: and how do I find what is a jobqueue host (which I can access?) [12:28:17] I think if you don't specify then it's refreshLinks2 or something [12:28:28] really? those are just the priority types [12:29:04] the jobqueue hosts have the role::applicationserver::jobrunner stanza in site.pp in the puppet repo [12:29:27] test.wp is a dedicated server still isn't it? [12:29:37] you could try running from there, it might work [12:29:50] apergos: and you vouch it wont break anything? ;) [12:29:51] assuming you're on it I mean [12:30:00] I don't vouch anything at all [12:30:14] just don't give it a large number of procs, give it like 2 [12:30:44] but really, I known only a little and the docs are scarce... I'm not sure if testwp is actually running on fenari if the code is there [12:30:58] no it's not [12:34:50] srv193 [12:35:33] the squid settings have that, which are /home/w/conf/squid [12:36:05] and more specifically text-settings.php [12:36:10] oh [12:36:17] (on fenari) [12:36:26] it's a hack, remember [12:37:05] PROBLEM - RAID on labstore1 is CRITICAL: CRITICAL: Partially Degraded [12:39:34] Cannot run a MediaWiki script as a user in the group wikidev [12:39:47] fair enough but what user should it be? [12:41:33] Nikerabbit: mwdeploy I guess [12:42:43] the message should give instructions i guess [12:42:44] or even attempt to sudo [12:43:31] !g I033acd9132e47b840c76da995d14c92e8376775d | Nikerabbit [12:43:31] Nikerabbit: https://gerrit.wikimedia.org/r/#q,I033acd9132e47b840c76da995d14c92e8376775d,n,z [12:43:53] Nikerabbit: that is the change, apparently you should even sudo -u apache [12:44:35] ah sorry (I was coding in another window) [12:53:10] hashar: cool, though commit message is weird place for documentation [12:59:18] Nikerabbit: sending a change :-] [13:04:11] Nikerabbit: isn't that the reason why you suggested to make a search for commits, docs etc. all together? :) [13:04:23] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:06:33] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:08:04] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54362 [13:08:17] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54363 [13:08:29] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54683 [13:08:53] New patchset: Hashar; "usage help when mwscript is not run as `apache` user" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56146 [13:09:09] New review: Hashar; "Follow up in https://gerrit.wikimedia.org/r/56146 which improves the error message." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44200 [13:09:12] Nikerabbit: https://gerrit.wikimedia.org/r/56146 :-) [13:11:58] apergos: hm, that looks like the log from a single process, not 3 processes per file. i guess we should add the PID to the output. [13:12:12] can you cive me more lines? say, 500 from each file? [13:14:44] http://p.defau.lt/?Nt0Y8zFul7tR05v5cWzg_A I can give you what my scrollback has [13:14:45] here's one [13:15:40] http://p.defau.lt/?tNCmGeGxlrFulzc8IOZimA here;;s two [13:21:28] New patchset: Matthias Mullie; "on frwiki, show CTA4 (signup or login) for 100%" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56149 [13:23:03] New patchset: Odder; "(bug 46461) Set $wgAutoConfirmCount to 50 for Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56150 [13:23:16] apergos: that looks a lot better, thanks [13:23:26] sure [13:41:32] New patchset: Aude; "Remove wikidata.org from CORS, keep only *.wikidata.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56153 [13:42:55] New review: Aude; "please also see https://gerrit.wikimedia.org/r/#/c/49069/ (and appreciate review of that patch) to m..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56153 [13:57:58] !log Jenkins: disabled the old Gerrit Trigger Plugin {{bug|46415}} [13:58:06] Logged the message, Master [14:05:35] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [14:07:45] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [14:09:10] !log jenkins has been restarted by mistake :-( Job building will be unavailable for up to half an hour. [14:09:16] Logged the message, Master [14:11:27] New patchset: Ottomata; "Syncing edit logs from gadolinium instead of locke" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56158 [14:12:29] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56158 [14:12:47] Change abandoned: Matthias Mullie; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54807 [14:13:58] New patchset: Demon; "Properly configure hooks-bugzilla plugin based on feedback" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [14:16:32] Change abandoned: Demon; "Squashed this into I1219d6ab." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55048 [14:17:03] New review: Demon; "Ignore PS1, just review PS2." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [14:17:16] New patchset: Matthias Mullie; "Update frwiki config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55946 [14:18:49] New review: Demon; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [14:21:49] Change abandoned: Matthias Mullie; "Has been folded into https://gerrit.wikimedia.org/r/#/c/55946/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56149 [14:28:41] New patchset: Matthias Mullie; "Update frwiki config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55946 [14:29:44] New patchset: Ottomata; "Now using varnish hostnames to filter for mobile logs. Now syncing mobile logs from gadolinium to stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56161 [14:29:53] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56161 [14:35:17] New patchset: Ottomata; "Changing minute on misc::statistics::rsync_job to use fqdn_rand to spread out rsync jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56162 [14:35:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56162 [14:48:22] New patchset: Ottomata; "Undoing the last change. fqdn_rand generates the same number on the same host (duh.)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56164 [14:49:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56164 [14:51:54] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [14:56:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [15:04:12] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [15:05:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: be_x_oldwiki to 1.21wmf12 [15:05:11] Logged the message, Master [15:06:23] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [15:36:54] can someone help..trying to deploy mw1209+ but running into rsync issues. puppet runs error free. added to site.pp and all dsh/node_groups..set to false in pybal [15:36:55] http://p.defau.lt/?lGV0Vkj1HN6G8whoaozYVg [15:37:40] there's been a few lock file errors like that around for a few days [15:37:41] reedy@fenari:/home/wikipedia/common$ ls -al php-1.21wmf11/.git/modules/extensions/MWSearch/index.lock [15:37:41] -rwxrw---- 1 catrope wikidev 0 Mar 20 23:58 php-1.21wmf11/.git/modules/extensions/MWSearch/index.lock [15:38:00] reedy: i understand the lock files to not be an issue [15:38:17] What is then? [15:38:19] it is this sync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9] [15:38:34] Isn't that just a result of the above errors? [15:38:36] at least that is what I was told by notpeter [15:38:48] live-1.5 was a folder, but it's now a symlink (as of last night) [15:40:40] The files on disk look sensible [15:41:24] reedy: yes...it looks like it is fetching now. [15:41:26] thx [15:41:42] reedy@fenari:/home/wikipedia/common$ apache-fast-test test.txt mw1209 mw1210 [15:41:42] testing 1 urls on 2 servers, totalling 2 requests [15:41:42] spawning threads... [15:41:42] http://en.wikipedia.org/wiki/Main_Page [15:41:42] * 200 OK 62443 [15:47:45] paravoid, mind if I poke again for a review of this: [15:47:46] https://gerrit.wikimedia.org/r/#/c/49710/ [15:47:46] ? [15:47:50] (hope you don't mind, cause I just poked!) [15:47:55] I don't mind :) [15:47:57] looking [15:48:07] PS19? [15:48:07] this is something that is on the analytics work in progress features mingle bla bla [15:48:09] lol! [15:48:11] so they ask me abou tit every day [15:48:13] haha [15:48:15] poor ottomata [15:49:02] if you want to see the inline comments, they are on patchset 13 [15:49:16] but I responded to the questionable ones in the main comment for 19 [15:51:47] New patchset: Hashar; "0.6.1-2 gbp.conf and tweaks" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [15:52:12] PROBLEM - profiler-to-carbon on professor is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [15:53:09] RECOVERY - profiler-to-carbon on professor is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [15:54:16] so, [15:54:36] I think moving upstart_job to a module is a good thing [15:54:44] even though it's a bit too much to ask of you for limn :) [15:55:03] well, if it were to be just moving the current define to a module, that's easy [15:55:04] how's that upstart module you mentioned? [15:55:06] is it any good? [15:55:10] complicated, looks pretty compllete [15:55:23] https://github.com/bison/puppet-upstart [15:55:40] oh that's for defining jobs too [15:55:59] oh my debian favorites :-D Got you a patch for the python-voluptuous debian package. https://gerrit.wikimedia.org/r/#/c/56168/ :] [15:56:00] yeah, via parameters, which is good, but probably won't cover all cases [15:56:05] added you both on review. [15:56:38] :) [15:56:57] paravoid: Zuul is not going to be packaged anytime soon. It needs a few more dependencies :] [15:58:13] hashar: oh hah, just saw your mails to debian-python [15:59:04] paravoid: and the good news are that OpenStack infrastructure people are really willing to package. [15:59:14] cool [15:59:19] I'd be willing to sponsor you btw [15:59:28] you don't need to go through mentors or zigo [15:59:37] New review: Lcarr; "Please rebase this on top of the recent nagios.pp -> misc/icinga.pp moves and then we should be good..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [15:59:43] yeah still have to have a look at the debian process to get a package submitted. Then I will know what a sponsor is :-] [15:59:53] but to sponsor it I'd like to see it under the python modules repos and processes [16:00:11] a sponsor is someone who reviews the package for you and ultimately uploads it into Debian [16:00:34] so my next step is to get the package hosted on alioth / debian-python svn [16:00:38] yep [16:00:46] although I hate svn nowadays, I guess I have to adapt [16:01:07] New review: Jeremyb; "seems technically fine but still needs some policy decisions." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56101 [16:03:22] New patchset: Aude; "Enable Wikidata data transclusion for some clients" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56169 [16:04:44] New review: Aude; "we would like https://gerrit.wikimedia.org/r/#/c/56165/ merged and deployed first" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56169 [16:04:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:07:00] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:07:54] ottomata: so the upstart module doesn't what upstart_job does [16:08:11] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:08:02 UTC 2013 [16:08:17] paravoid, no? [16:08:51] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:09:05] upstart::job installs the /etc/init file, i see that it also manages the service…which is ok [16:09:06] i guess [16:09:11] i think i'd rather it didn't [16:09:18] that's not what upstart_job does :) [16:09:30] upstart job installs the /etc/init file, no? [16:09:35] upstart_job* [16:09:40] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:09:34 UTC 2013 [16:09:40] well, not that just that [16:09:46] not just that even [16:09:47] gah [16:09:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:10:27] that and symlinks /etc/init.d to upstart-job [16:11:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:10:51 UTC 2013 [16:11:23] yeah [16:11:31] anything else? [16:11:34] I don't see why it does exec and not service after that [16:11:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:12:10] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:12:01 UTC 2013 [16:12:11] why ours? [16:12:15] yeah i'm not sure either [16:12:22] New patchset: Mark Bergsma; "Add cp3004 to the esams upload pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56171 [16:12:27] its also weird that $install and $start are "false" [16:12:34] by default [16:12:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:13:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:12:59 UTC 2013 [16:13:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56171 [16:13:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:14:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:13:50 UTC 2013 [16:14:22] New patchset: Hashar; "Jenkins test (DO NOT SUBMIT)" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56172 [16:14:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:15:20] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Wed Mar 27 16:15:17 UTC 2013 [16:15:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:19:37] New patchset: Hashar; "Jenkins test (DO NOT SUBMIT)." [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56172 [16:20:51] off for now [16:27:29] New review: awjrichards; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [16:46:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55301 [16:51:35] paravoid, sorry, back, I was grabbing lu ch [16:51:35] lunch [16:51:35] so what should I do with the limn puppet stuff? [16:51:47] New review: Dzahn; "root@neon:~# /usr/lib/nagios/plugins/check_http -H 'search1023' -p 8123 -url='/status/' --regex FAIL..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [16:53:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56146 [16:54:34] New review: Dzahn; "see above in my last comment, am i using the wrong search boxes or does it not work yet? the ones i ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [17:03:14] New review: Daniel Kinzler; "Do we somewhere also force allowDataTransclusion to false for all the other wikis? It's true per def..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56169 [17:06:23] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [17:07:33] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [17:11:24] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:14:49] New patchset: Mattflaschen; "Add labs redis subclassing the main one and setting directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [17:16:44] New patchset: Aaron Schulz; "Cleanup high priority jobs listing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56179 [17:18:43] New patchset: Jeremyb; "(bug 45066) Disable anonymous page creation at tr.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56101 [17:19:26] New review: Jeremyb; "carry forward -1 for shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56101 [17:19:33] notpeter: https://gerrit.wikimedia.org/r/#/c/56179/1 [17:20:44] AaronSchulz: looks suspicious [17:21:00] possibly also photoshopped [17:21:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56179 [17:21:53] ok, merged on sockpuppet [17:28:12] mark, hi, do you want to discuss our varnish mobile ACL issue at some point? Would love to get some direction while i'm in SF [17:28:56] or who would be the best varnish discussion contact? [17:29:12] Change abandoned: Demon; "Not going to do this. Certainly not this way." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53173 [17:29:13] use ops@? :) [17:30:03] well, i was hoping to get some direction perspective before asking intelligent questions :) [17:30:09] New patchset: RobH; "removing /home from terbium, death to nfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56182 [17:30:53] also, we are looking for some help to set up a varnish test env [17:31:00] yurik: apt-get install varnish [17:31:01] :D [17:31:15] I'm sure others would be interested in those directions too, both in and out of the ops team [17:31:20] i knew it was easy! :) but i was hoping to mimic our prod env [17:31:56] goals being: 1) long list of ACLs performance impact 2) unit testing our config changes [17:32:28] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56182 [17:32:31] I profiled the acl matches [17:32:39] to see their impact [17:33:04] did extensive testing the other night, I couldn't sleep :) [17:33:15] :) what's the verdict? [17:33:44] the original was 20-25% slower than the revised one [17:34:05] or maybe more [17:34:11] I should fetch the numbers again, I keep forgetting [17:34:35] 41.20 2.94 2.94 100000000 29.41 34.39 match_acl_named_beeline_orig [17:34:38] 33.35 5.32 2.38 100000000 23.81 28.79 match_acl_named_beeline_new [17:34:48] 23.81 vs. 29.41 ns per call (on my system) [17:35:27] paravoid, i would think the "unmatched ip" is the worst case scenario [17:35:33] no it's not [17:35:49] well, it's not for testing a single ACL [17:36:02] why not, it would go through all the nested if/elses? [17:36:13] it's hierarchical [17:36:24] worst case would be the 217 range in the beeline acl [17:36:55] so if we didn't have ANY acls in that config file, how much faster will the varnish file work? [17:36:59] it has to go until the middle in the first-level comparisons and then do all the second-level comparisons [17:37:06] that's again for a single ACL [17:37:31] if you want to test against all of them, bailing out on first match, then unmatched would probably be worst case [17:37:57] anyway, as I said, maybe this should be a mailing list thread? [17:38:07] I'm sure both mark and binasher would be interested [17:38:11] ok, will email :) [17:38:22] and probably more ops people [17:38:24] please reply with your perf data [17:38:31] Ryan_Lane: labstore1 - RAID, partially degraded, creating ticket to have its disk replaced [17:38:55] paravoid, how complex would it be to implement an ACL switch statement? [17:39:00] what do you mean? [17:39:13] a new VCL language construct [17:39:34] switch (req.ip) { case ACL1: code; break; case ACL2: ... [17:40:05] better question would be - is this worth an investment to begin with :) [17:40:13] er? [17:40:17] I don't understand [17:40:47] well, the current bottleneck as i see it is the linear algo used by varnish to check against ACL lists [17:41:13] !log authdns-update, pdns on ns1 croaked, restarted [17:41:16] instead, varnish script compiler should generate an ordered list of IPs [17:41:19] Logged the message, Master [17:41:26] and some code to do binary search in that list [17:41:40] this instantly shrinks the ACL search to log N [17:41:48] ?? [17:41:58] and makes VCL code much more consise [17:42:23] could you give an example of what you're proposing? [17:42:40] a switch statement vs. if/elses doesn't make any difference algorithm-wise [17:42:42] basically, instead of writing if (req.ip ~ acllist1) { set ... } else if (req.ip ~ acllist2 ) { set ... } [17:42:52] mutante: thanks [17:42:58] sure it does if the search is optimized internally [17:43:07] the generated code would be : [17:43:50] switch( searchIpInAllACLs(req.ip) ) { case 1: set ...; break; case 2: set ...; } [17:44:11] and the searchIp is autogenerated code to binary search in a list of all ip blocks [17:44:54] yurik: buy the ticket, take the ride [17:44:59] yurik: i mean, write the code, submit the patch [17:45:37] binasher, i would love to, but suspect it will be a fairly complex undertaking, so would like to get a good grasp of how much this is needed first [17:45:51] that would be more optimal algorithmically indeed [17:46:01] i will spell it out in my email [17:46:23] it wouldn't be that complex btw [17:46:28] you can write inline C in VCL [17:47:00] you don't have to change the VCL syntax, just write a C function that gets an ip and returns the carrier string [17:47:02] indeed, it doesn't have to be a varnish patch at all [17:47:09] inline C or vmod [17:47:38] and then do patricia lookups or whatever [17:47:46] paravoid, one of my goals is to make the script autogenerated based on the user settings [17:47:57] which user settings? [17:48:01] ip blocks [17:48:09] you can make your code read a file if that's what you want [17:48:16] as long as you don't make it to read it on every request [17:48:32] New patchset: RobH; "removal of home mount dependent entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56185 [17:48:40] or we'll have a stroke or something [17:48:47] hehe [17:49:04] that might be a good idea - does varnish have any kind of static state store? [17:49:19] or should i just declare a global static var :) [17:49:25] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56185 [17:49:27] refreshing it would be a pain [17:50:05] all that is of course orthogonal to communicating to telcos that we want ip blocks and not just random /31s or /32s [17:50:27] paravoid, that's up to the biz ppl really, they are the ones dealing with the telcos [17:50:33] and applying common sense by aggregating those blocks [17:50:34] ^demon: did hashar fix the zuul bug, gerrit is checking things fast today. [17:50:35] ? [17:50:38] (its awesome) [17:50:45] <^demon> Yeah, I think he cherry picked something in. [17:50:46] this isn't a "business" requirement, this is a technical requirement [17:50:49] He backported a fix, I think [17:50:59] it ran the check and updated my changeset before i could click initial link [17:51:00] really, we should just say no when < /24s are requested [17:51:08] * RobH is amazingly happy about this [17:51:13] \o/ [17:51:14] paravoid, i already told them, so lets hope it will be that way :) [17:51:28] yeah, I don't think anyone's going to come back [17:51:44] maybe when we -1 the next patch submitted 24h before a launch :) [17:51:57] heh [17:52:05] wondered why they submitted some IPs with a /32 and some without any netmask at all [17:52:15] i guess noone thought adding those ACLs would be a problem [17:52:46] even if you ignore the performance benefits, this was just loads of junk [17:52:49] in our configs [17:53:01] like .6.0/24 and then .6.5, .6.6, .6.7 etc. [17:53:19] (paravoid, i can limn talk real talk whenever you can? :) ) [17:53:38] mutante, they just wanted to be difficult... or, more likely, they have some internal file they keep for all their mobile data clients, and they didn't bother cleaning it up [17:53:46] ottomata: yeah, sorry about that... [17:53:54] no worries [17:53:58] ottomata: I promise I'll give a meaningful hopefully final review today [17:54:09] ok cool, the biggest open question is the upstart thing [17:54:12] New review: Alex Monk; "Depends on abandoned patchset." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56130 [17:54:17] if you've got advice on that right now I can work on it before you final review [17:55:32] New review: Ottomata; "Ok with me, but I'm still learning git buildpackage best practices." [operations/debs/python-voluptuous] (master); V: 1 C: 1; - https://gerrit.wikimedia.org/r/56168 [18:00:59] New review: Ottomata; "Did I totally miss the setup.py stuff Faidon was complaining about? I don't see that at all." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54324 [18:05:26] !log mw1130 - kill and fix puppet runs [18:05:32] Logged the message, Master [18:06:27] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [18:08:33] !log kaldari synchronized php-1.21wmf12/extensions/Echo 'syncing Echo for wmf12' [18:08:36] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [18:08:40] Logged the message, Master [18:08:49] !log bunch of mw* hosts: fix puppet runs [18:08:56] Logged the message, Master [18:11:06] PROBLEM - Puppet freshness on mw1095 is CRITICAL: Puppet has not run in the last 10 hours [18:15:05] New review: Ottomata; "The *-labs-proxy files are still in the module. Generic modules like this shouldn't have to be modi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [18:19:53] New review: Mattflaschen; "It works, so now I just need a review." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [18:22:54] !log installing package upgrades on neon (Icinga), incl. ganglia-monitor,gmeta [18:23:01] Logged the message, Master [18:24:06] PROBLEM - Puppet freshness on mw62 is CRITICAL: Puppet has not run in the last 10 hours [18:24:38] New review: Odder; "Scratch that last one, I must have been fast asleep (at 4:42) when I was writing that comment; $wgAu..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56055 [18:25:06] PROBLEM - Puppet freshness on mw1135 is CRITICAL: Puppet has not run in the last 10 hours [18:27:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf12 [18:27:57] Logged the message, Master [18:30:35] wee [18:31:15] New patchset: Reedy; "Everything else to 1.21wmf12" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56188 [18:32:39] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56188 [18:36:17] PROBLEM - Puppet freshness on mw51 is CRITICAL: Puppet has not run in the last 10 hours [18:36:36] mutante: you killed off db11 last night right? [18:39:30] !log reedy synchronized php-1.21wmf12/extensions/Wikibase [18:39:37] Logged the message, Master [18:40:04] cmjohnson1: yea, added to decom.pp [18:40:18] cmjohnson1: but for some reason not removed from monitoring .. i saw ..hrmm [18:40:47] okay..icinga is still reporting puppet....i wonder if icinga doesn't pull from decom.pp to remove from monitoring [18:40:55] for which one? [18:40:59] New patchset: Reedy; "Enable Wikidata data transclusion for some clients" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56169 [18:41:01] which host I mean [18:41:05] db11 [18:41:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56169 [18:41:43] so, what needs to be run to be removed from icinga is [18:41:46] cmjohnson1: it should, it does things like nagios used to [18:41:56] are we sure puppets run on neon? [18:42:08] a puppet run on stafford or sockpuppet, then a cronjob that runs once every few hours or so to remove it from exported resources [18:42:17] then a puppet run on neon to remove it from nagios [18:42:22] this can take quite a while [18:42:38] !log reedy synchronized wmf-config/InitialiseSettings.php [18:42:44] if its a known decom, you can login to icinga as admin, and acknowledge it [18:42:45] Logged the message, Master [18:42:49] so its not on main screen until neon catches up. [18:42:51] cmjohnson1: nevertheless, i disabled all notifications for it [18:43:10] nutante: how? [18:43:15] mutante ^ [18:43:32] oh, you know, it is already gone, kind of, see here: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=db11 [18:43:43] now it's probably neon's problem with snmptt [18:44:27] cmjohnson1: by clicking in icinga web ui, go to a host page, and "Disable notifications for all services on this host" [18:44:48] ok..simple enough [18:44:51] thx [18:44:51] you need to be logged in, using icinga-admin.wm [18:44:54] np [18:45:08] cmjohnson1: https://icinga-admin.wikimedia.org/icinga/ [18:45:16] uses your gerrit login details (as they are ldap based) [18:48:48] New patchset: RobH; "removing home dependent maint script entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56190 [18:49:14] paravoid: i keep seeing "flush-252:2" in ps of neon, isn't seeing flush processes use CPU a lot -> kernel bug? i mean it's not like it's using a ton, but it keeps popping up [18:50:12] also, upgraded gmetad and ganglia-monitor, there were package upgrades avail. [18:50:31] at least ganglia-monitor is a wmf package [18:50:38] eh, both are [18:50:43] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56190 [18:50:54] !log pushing mw1209-1220 into service [18:51:00] Logged the message, Master [18:52:17] RECOVERY - Puppet freshness on mw1001 is OK: puppet ran at Wed Mar 27 18:52:13 UTC 2013 [18:52:17] RECOVERY - Puppet freshness on mw1095 is OK: puppet ran at Wed Mar 27 18:52:14 UTC 2013 [18:52:17] RECOVERY - Puppet freshness on mw1025 is OK: puppet ran at Wed Mar 27 18:52:14 UTC 2013 [18:52:27] RECOVERY - Puppet freshness on mw51 is OK: puppet ran at Wed Mar 27 18:52:17 UTC 2013 [18:52:27] RECOVERY - Puppet freshness on mw62 is OK: puppet ran at Wed Mar 27 18:52:18 UTC 2013 [18:52:27] RECOVERY - Puppet freshness on mw1120 is OK: puppet ran at Wed Mar 27 18:52:19 UTC 2013 [18:52:27] RECOVERY - Puppet freshness on mw1099 is OK: puppet ran at Wed Mar 27 18:52:19 UTC 2013 [18:52:27] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Wed Mar 27 18:52:23 UTC 2013 [18:53:35] <-- i believe these are back because i just killed some stuck snmptt proc [18:53:38] bbl [18:53:40] RECOVERY - Puppet freshness on mw1130 is OK: puppet ran at Wed Mar 27 18:53:32 UTC 2013 [18:53:40] RECOVERY - Puppet freshness on mw1135 is OK: puppet ran at Wed Mar 27 18:53:34 UTC 2013 [18:56:41] New patchset: Demon; "Don't use home for maintenance scripts where possible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56192 [18:58:58] New review: Demon; "misc::maintenance::update_special_pages and misc::maintenance::refreshlinks still need some love to ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56192 [19:05:18] !log reedy Started syncing Wikimedia installation... : Rebuild message cache for WikiData deployment [19:05:26] Logged the message, Master [19:08:35] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:10:45] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [19:18:30] Ugh [19:18:41] Scap just gets noisier and noisier... [19:31:11] New review: Siebrand; "This fixes bug 46603." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56179 [19:31:52] New review: Aaron Schulz; "I wasn't doing this for that bug, it was just on my TODO list and there happened to be a bug for it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56179 [19:37:15] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [19:37:15] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [19:37:15] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [19:37:51] !log reedy Finished syncing Wikimedia installation... : Rebuild message cache for WikiData deployment [19:37:58] Logged the message, Master [19:42:46] New review: Ori.livneh; "@Ottomata: Yeah -- I fixed it in PS2." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54324 [19:43:12] !log reedy synchronized php-1.21wmf12/extensions/ParserFunctions/ [19:43:21] Logged the message, Master [19:44:05] New review: Ottomata; "Ah, I see it, I thought I was looking at patchset 1 (gerrit browsing weirdness)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54324 [19:51:08] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56192 [19:51:23] RECOVERY - RAID on labstore1 is OK: OK: State is Optimal, checked 12 logical device(s) [19:52:56] New patchset: Odder; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [19:54:26] New patchset: RobH; "adding back fixed entries to terbium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56201 [19:55:31] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56201 [19:56:01] !log reedy Started syncing Wikimedia installation... : Rebuild localisation cache to fix he related ParserFunctions problems [19:56:09] Logged the message, Master [19:56:36] mutante or sbernardin did either one of you do something w/ disk 9 on labstore1? [19:57:03] No [19:58:07] mutante fixed some raid montioring [19:58:10] so it may have been that [19:58:16] i asked him when i saw it scroll past [19:58:20] ok...cuz mutante had a ticket for degraded raid but when I checked disk 9 was in rebuild [19:58:21] he didnt actually touch labstores [19:58:30] interesting [19:58:54] but it finished and all is good now...but not sure why it would rebuild on its own [19:59:16] so you can resolve ticket, but note what happened, that it rebuilt itself [19:59:23] yep [20:04:13] New patchset: Odder; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [20:08:17] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:08:47] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [20:10:24] !log reedy Finished syncing Wikimedia installation... : Rebuild localisation cache to fix he related ParserFunctions problems [20:10:27] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [20:10:30] Logged the message, Master [20:13:13] New patchset: Odder; "(bug 46182) Set LQT as opt-out on se.wikimedia (chapter wiki)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56130 [20:13:48] That was quick ;) [20:13:54] 14 minutes! [20:17:28] cmjohnson1: i don't know how that fixed .. ugh too late [20:20:47] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [20:45:58] New patchset: RobH; "terbium tweaking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56247 [20:46:54] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56247 [21:03:08] New patchset: Ottomata; "Move geoip to a module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [21:05:51] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [21:07:48] ottomata: thank you for taking care of the geoip module :-] [21:08:01] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [21:08:10] yeah, i'm having weird commit problems [21:08:13] trying to fix [21:08:17] jenkins says it can't merge and I have to rebase [21:08:18] uhhhh [21:08:29] actually, since youer' here [21:08:33] i have a conflict on contint.pp [21:08:40] ottomata: most probably because one of the old files has been changed meanwhile [21:08:43] oh [21:08:49] did you delete this class? [21:08:50] misc::contint::analytics::packages [21:09:10] I guess you should accept whatever comes from HEAD [21:09:13] yeah [21:09:15] plus my change, ok [21:09:17] just checking [21:09:31] oh cool, yeah [21:09:32] ok [21:09:33] i see [21:09:39] we don't need to include geoip on contint anymore? [21:09:45] we do iirc [21:09:47] since you deleted misc::contint::analytics::packages [21:09:57] production HEAD isn't doing it anymore [21:09:58] it is in its own module now [21:10:05] right but contint.pp isn't including it [21:10:05] New patchset: Asher; "changing s1 slave weights for testing" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56313 [21:10:08] anywhere [21:10:10] even in production head [21:10:14] so the include should be something like contint::analytics::packages [21:10:16] it hasn't been moved to a module there yet [21:10:20] ah [21:10:38] unless you moved it to a different file [21:10:53] nothing matching 'analytics::packages' in the whole project [21:10:54] so :) [21:10:57] guess its ok? [21:11:02] I guess :-] [21:11:13] it is not going to uninstall the packages anyway [21:11:29] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56313 [21:11:40] ja [21:12:08] New patchset: Ottomata; "Move geoip to a module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [21:12:22] !log asher synchronized wmf-config/db-eqiad.php 'changing s1 slave weights for testing' [21:12:22] New review: Odder; "Community approval for change gathered at https://tr.wikipedia.org/w/index.php?oldid=12797562#Teklif" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56101 [21:12:29] Logged the message, Master [21:14:29] ^demon: see the mangled link in that last comment from gerrit-wm [21:14:32] :( [21:14:53] New review: Ottomata; "Ok! Lots of changes since the last patchset." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [21:15:41] ottomata: also the geoip module should install the ubuntu database when in labs. We do not have access to the private geoip.dat [21:15:50] ottomata: I think faidon gave a hint about it on code review [21:17:15] oh right hmmmmm [21:17:21] i can do that! [21:17:24] :) [21:18:11] hey guys, who's best to ask about nginx proxy stuff? [21:18:18] analytics people are curious about how nginx proxies to the mobile site [21:18:28] you mean HTTPS? [21:18:48] (ottomata) [21:18:49] ottomata: you rocks :-] [21:19:00] ahh nginx [21:19:19] I looked at the prod manifest. Mobile would like the beta cluster to support HTTPS [21:19:37] which mean we need to abstract out the nginx conf to fit with another context [21:19:43] I guess that would apply for analytics too [21:20:35] jeremyb_, yes, and IpV6, but yeah HTTPS [21:20:47] well, they're asking because I thikn they are looking for some data [21:20:51] ottomata: so what's the question? :) [21:21:00] and are wondering if they are missing some because the nginx requests might not go to cp1041-1044 [21:21:16] we are only collecting data from those hosts, because we assume that they are the only frontend cache hosts serving mobile data [21:21:22] mobile sites* [21:21:32] but, if nginx happens to proxy elsewhere, then they would be missing some data [21:21:35] for HTTPS [21:21:51] can't we have the nginx frontend send them the logs ? [21:22:29] 8: [21:22:34] heheheh [21:22:37] well [21:22:37] 1. [21:22:42] that would duplicate the requests in the logs [21:22:59] 2. nginx sends bad sequence numbers, which messes with packet loss monitoring [21:23:02] but! [21:23:22] they need to know IF the logs are not being collected before we change our behavior [21:23:34] if nginx does in fact proxy there, then no probs. [21:23:44] mobilefrontend doesn't embed a backend name in page source? [21:23:52] :( [21:25:52] well empirically i'm using 1041,1042,1044 for the few reqs i just made [21:30:13] New review: Hashar; "The url is wrong, the trailing slash need to be removed. Amending." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [21:31:28] New patchset: Hashar; "monitoring lucene search boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [21:32:27] ottomata: what does pybal say? [21:32:43] how do I ask it? [21:32:44] :p [21:32:48] idk [21:32:51] hah [21:33:16] New review: Dzahn; "thanks, this is better:" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/51174 [21:33:38] HTTP CRITICAL: HTTP/1.1 200 OK - pattern found [21:33:38] New review: Hashar; "PS4 removes the trailing slash in the url '/status/' -> '/status' . The commit message had the corre..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [21:33:44] \O/ [21:33:45] scapping... [21:33:53] hashar: eh.. it's CRIT and OK :) [21:34:07] mutante: also some search boxes are wrong right now. I guess some indices have been moved from some boxes to others. [21:34:18] mutante: xyzram is aware of it since he posted on ops-l [21:34:47] mutante: and leslie confirmed that a CRITICAL check will issue a page. So I guess we want to hold this change a bit until all search boxes are fully operational (aka no more reporting FAILED indices)`. [21:34:59] not all CRITICALs [21:35:13] ah [21:35:16] you can check the nagios log to see if it really paged [21:35:22] if we flag it properly :) [21:35:24] hashar: only if you would add critical => critical in the monitor_service [21:35:25] well if we manage to make it not page, that would be nice [21:35:27] or the screams of all ops folks [21:35:31] but I have no idea how to do that [21:35:46] it should not page as it is now [21:35:56] but i don't see why it would be HTTP OK and CRIT at the same time [21:35:59] with exit code 2 [21:36:04] icinga is soooooo slwo [21:36:06] slow* [21:36:29] yeah it need to be tweaked [21:36:31] ;-D [21:36:43] there are too many checks and we hit the number of checks limit [21:36:46] hashar: it's the --invert-regex thing somehow [21:37:04] New patchset: Ottomata; "Move geoip to a module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [21:37:06] no, it's gmetad and ganglia_parser [21:37:08] well the biggest problem right now is honestly ganglios since we need to run gmetad on there [21:37:22] if we can move ganglios to page from nickel (the ganglia machine) we will be much better [21:37:29] if you want to check that out hashar ? ;) [21:37:48] hashar: with just --regex FAILED: [21:37:48] hashar, check that one out! if realm != production, install geoip-database :) [21:37:59] HTTP OK: HTTP/1.1 200 OK [21:38:10] with --invert-regex --regex FAILED [21:38:24] HTTP CRITICAL: HTTP/1.1 200 OK [21:39:16] yeah that is what you want [21:39:21] FAILED -> cirtical [21:39:23] which server? [21:39:40] search1019 [21:39:50] New patchset: Ottomata; "Move geoip to a module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [21:40:09] curl http://search1019.eqiad.wmnet:8123/status [21:40:18] that gives me some zhwiki*.spell indices being in failure [21:40:24] so that should give a CRITICAL error [21:40:48] The program 'curl' is currently not installed. :/ [21:40:52] ah [21:40:55] use fenari :-] [21:41:20] i don't know what you want to achieve though.. [21:41:36] ok [21:41:58] fenari does not have the nagios plugin.. but yeah [21:42:37] so if that is supposed to be CRIT, but still tell you that HTTP itself is 200, then it seems fine [21:42:49] using --invert-regex [21:44:53] New review: Dzahn; "on search1019 with known brokenness:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [21:44:54] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [21:45:17] root@neon:~# echo $? 2 [21:45:17] jeremyb_, i just checked too (not sure if you already told me this), I see [21:45:17] X-Varnish: 1252984284 [21:45:17] Age: 0 [21:45:17] Via: 1.1 varnish [21:45:17] X-Cache: cp1041 frontend miss (0) [21:45:25] when I request https://m.wikipedia.org/ [21:45:29] so, that's good (i think!) [21:46:03] mutante: basically I want to warn whenever the lucene /status page contains FAILED :-] [21:46:12] ottomata: yeah... talk to ma rk or leslie or ryan [21:46:38] mutante: note that it might page you about it [21:46:59] I have no idea how to disable paging [21:47:27] not right now … :) [21:47:48] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment, 2nd attempt [21:47:55] Logged the message, Master [21:48:18] hashar: it will not page me [21:48:22] (no worries Leslie, I'm pretty sure I understand it) [21:48:38] hashar: you dont have to disable it, you have to enable it [21:48:44] mutante: ahh [21:48:46] thanks :) [21:48:58] monitor_service { 'lucene_search': description => 'Lucene search', check_command => "check_luc [21:49:01] ene_frontend" } [21:49:02] not paging [21:49:08] mutante: make sure to ping notpeter IRL about the new search monitoring [21:49:49] monitor_service { 'lucene_search': description => 'Lucene search', check_command => "check_lucene_frontend", critical => true } [21:49:55] <-- that WOULD be paging, hashar [21:50:16] nagios.pp:define monitor_service ... $critical="false" [21:50:19] <- default is false [21:51:44] \O/ [21:52:15] done in not-RL :p [21:54:58] * jeremyb_ senses that dr0ptp4kt just figured out he was in the wrong place [21:55:43] * jeremyb_ senses that dr0ptp4kt is going to be one of those people I have a lot of trouble consistently associating with the right other names. e.g. abaso [21:58:03] jeremyb: :) [22:04:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:04:58] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment, 2nd attempt [22:05:05] Logged the message, Master [22:07:03] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [22:07:33] PROBLEM - Lucene search on search1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58473 bytes in 0.027 second response time [22:07:43] PROBLEM - Lucene search on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55565 bytes in 0.011 second response time [22:09:24] that's not so useful :( [22:09:46] what's the problem now [22:10:04] the only relevant thing it says is CRITICAL [22:10:04] i just asked a couple times.. [22:10:17] it doesn't say anything about *what* is broken [22:10:40] 14:55 < hashar> mutante: basically I want to warn whenever the lucene /status page contains FAILED :-] [22:10:55] right [22:11:03] * jeremyb_ may tweak it later [22:11:33] 14:43 < hashar> mutante: xyzram is aware of it since he posted on ops-l [22:11:37] 14:44 < jeremyb_> you can check the nagios log to see if it really paged [22:12:48] jeremyb_: yeah I know, but that is better than nothing [22:12:54] ideally we should write our own plugin [22:14:13] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55228 [22:14:29] the 2.5 people who could fix it, already know what it means , right [22:16:25] yup [22:16:30] I guess it is good enough for now [22:16:43] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [22:16:44] jeremyb_: feel free to write a nagios plugin :-] [22:17:18] hashar: i feel exactly that way :) [22:19:36] New patchset: awjrichards; "Make Special:LoginHandshake used everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56327 [22:20:21] New patchset: Dzahn; "be more specific about tell *what* is broken in new search monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56328 [22:20:28] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56327 [22:22:07] !log maxsem synchronized wmf-config/InitialiseSettings.php [22:22:14] Logged the message, Master [22:23:55] !log maxsem synchronized wmf-config/InitialiseSettings.php [22:24:03] Logged the message, Master [22:30:21] New patchset: Dzahn; "be more specific about *what* is broken in new search monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56328 [22:30:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56328 [22:32:00] :) [22:32:04] I am out to bed *wave* [22:32:17] bye [22:37:10] PROBLEM - Lucene search on search1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.030 second response time [22:37:29] PROBLEM - Lucene search on search1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58473 bytes in 0.019 second response time [22:37:29] PROBLEM - Lucene search on search16 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.112 second response time [22:37:39] PROBLEM - Lucene search on search1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.015 second response time [22:37:39] PROBLEM - Lucene search on search15 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.110 second response time [22:47:13] !log maxsem synchronized php-1.21wmf12/extensions/MobileFrontend 'touch' [22:47:20] Logged the message, Master [22:49:46] !log maxsem synchronized php-1.21wmf11/extensions/MobileFrontend 'touch' [22:49:53] Logged the message, Master [22:50:56] New patchset: Yurik; "Added Vimpelcom Mobilink Pakistan ACLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56333 [22:52:21] yurik__: most of pakistan reads english? [22:55:07] New patchset: Reedy; "Basic config for ukwikivoyage and hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56335 [22:55:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56335 [22:56:19] woooo, Reedy's making wikivoyages! [22:57:34] :) [23:00:36] New review: Yurik; "Need to revert landing page for single-language telcos" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55302 [23:01:10] !log reedy synchronized wmf-config/InitialiseSettings.php [23:01:17] Logged the message, Master [23:02:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:02:29] Logged the message, Master [23:03:19] New patchset: Reedy; "dblists and wikiversions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56338 [23:03:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56338 [23:04:18] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:05:58] PROBLEM - search indices - check lucene status page on search1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.016 second response time [23:06:18] PROBLEM - search indices - check lucene status page on search1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58473 bytes in 0.010 second response time [23:06:28] PROBLEM - search indices - check lucene status page on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55565 bytes in 0.012 second response time [23:06:28] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [23:06:38] PROBLEM - search indices - check lucene status page on search1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58473 bytes in 0.011 second response time [23:06:38] PROBLEM - search indices - check lucene status page on search16 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.110 second response time [23:10:48] !log reedy synchronized wmf-config/InitialiseSettings.php [23:10:55] Logged the message, Master [23:11:52] I hope those lucene alerts didn't page anyone [23:14:32] TimStarling: they did not, they dont have critical => true [23:15:09] yurik: manifests/site.pp line 273 [23:15:11] they are detecting string FAILED on lucene status pages [23:16:03] did not :) woot [23:16:10] except i love the paging on 200 OK [23:17:05] it is technically correct, it tells you it gets a 200 BUT found a string FAILED inside the page [23:17:11] !log reedy synchronized php-1.21wmf12/cache/interwiki.cdb 'Updating 1.21wmf12 interwiki cache' [23:17:12] which in turn tells us.. to check index creation [23:17:17] Logged the message, Master [23:18:05] it's this: https://gerrit.wikimedia.org/r/#/c/51174/4 [23:20:10] thanks LeslieCarr !! [23:22:50] !log upgrading wikitech to wmf/1.21wmf12 [23:22:52] !log reedy synchronized wmf-config/InitialiseSettings.php [23:22:56] Logged the message, Master [23:23:02] Logged the message, Master [23:24:18] !log finished upgrading wikitech [23:24:24] TimStarling: Numerous of the revisions we added to newdeploy will need merging into master of mediawiki-config [23:24:24] Logged the message, Master [23:24:32] due to all the hard coded paths and such [23:28:27] New patchset: Reedy; "More configuration for new wikivoyages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56339 [23:28:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56339 [23:38:38] !log updating wikitech-static to wmf/1.21wmf12 [23:38:44] !log reedy synchronized wmf-config/InitialiseSettings.php [23:38:45] Logged the message, Master [23:38:51] Logged the message, Master [23:40:21] !log finished updating wikitech-static [23:40:28] Logged the message, Master [23:41:34] PROBLEM - search indices - check lucene status page on search1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.024 second response time [23:41:54] PROBLEM - search indices - check lucene status page on search15 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 61432 bytes in 0.110 second response time [23:44:00] !log reedy synchronized wmf-config/InitialiseSettings.php [23:44:07] Logged the message, Master [23:46:19] !log reedy synchronized wmf-config/InitialiseSettings.php [23:46:26] Logged the message, Master [23:48:40] New patchset: Reedy; "Rest of config for ukwikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56341 [23:49:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56341 [23:58:42] !log reedy synchronized php-1.21wmf12/extensions/Wikibase [23:58:48] Logged the message, Master