[00:08:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.922 seconds [00:21:14] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:21:14] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:55:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:11] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [01:09:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.277 seconds [01:41:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 257 seconds [01:44:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:57:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.639 seconds [02:21:14] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:26:25] !log LocalisationUpdate completed (1.21wmf2) at Thu Oct 18 02:26:24 UTC 2012 [02:26:44] Logged the message, Master [02:31:51] New review: Dzahn; "chmod 644 files in noc docroot per hashar" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/28334 [02:31:51] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28334 [02:32:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.235 seconds [02:47:22] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [02:50:46] !log LocalisationUpdate completed (1.21wmf1) at Thu Oct 18 02:50:46 UTC 2012 [02:50:59] Logged the message, Master [03:09:14] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [03:15:14] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [03:18:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.826 seconds [03:58:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:12:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.512 seconds [04:36:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:47:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [05:35:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:37:17] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [05:49:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [05:58:17] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:58:17] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:23:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.061 seconds [07:09:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [07:56:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.181 seconds [08:44:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:49:15] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:57:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.962 seconds [09:32:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:04] New patchset: Tim Starling; "Initial import of librsvg from Lucid, package version 2.26.3-0ubuntu1.1. Used librsvg instead of librsvg2 as a directory name, to match the source package name in Ubuntu after it was changed in 2.12.7-1" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28493 [09:43:04] New patchset: Tim Starling; "* Refreshed and re-added wikimedia-brand.patch, based on the hardy one * Wrote a new security patch, aimed at upstream compatibility. Instead of external file references simply being patched out, a new command line option is added to rsvg-convert allowing" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28494 [09:43:05] New patchset: Tim Starling; "Updated to librsvg_2.36.1-0ubuntu1 (precise)" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28495 [09:43:05] New patchset: Tim Starling; "Re-added security patch" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28496 [09:43:46] Change merged: Tim Starling; [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28493 [09:43:57] Change merged: Tim Starling; [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28494 [09:44:01] Change merged: Tim Starling; [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28495 [09:44:06] Change merged: Tim Starling; [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/28496 [09:46:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.762 seconds [10:15:27] New patchset: Hashar; "beta: autoupdater now report full output on error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28497 [10:16:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28497 [10:18:54] New patchset: Hashar; "beta: autoupdater now reports full output on error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28497 [10:19:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28497 [10:22:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:22:15] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:34:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [10:50:48] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [10:57:14] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [11:02:47] mark: apergos: paravoid: if anyone of you is connected, could you restart memcached on virt0 ? It died a few minutes ago. [11:02:58] I am wondering if it got killed by the OOM catcher [11:08:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.718 seconds [11:21:52] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.000 second response time on port 11000 [11:22:29] ah it is back :) [11:55:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:06:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.421 seconds [12:12:30] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:48] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [12:22:15] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:23:15] New review: Dereckson; "CreditSource -> CreditsSource" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/28238 [12:40:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:27] New review: Hashar; "Dupe of Chris I929feecd" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28238 [12:48:41] New review: Hashar; "Dupe of Matthais I7ecf369b" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/28375 [12:56:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [12:56:29] New patchset: Dereckson; "(bug 41167) Namespace configuration for ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [13:10:17] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [13:16:15] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [13:18:11] New review: Hashar; "The reason I did that is that the wikibugs perl script need to be deployed at a specific version whi..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27175 [13:29:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [13:50:45] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:07] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.706 second response time [14:13:50] New patchset: Matthias Mullie; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [14:14:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:19] !log re-adding srv190 to rendering pool [14:19:33] Logged the message, notpeter [14:21:03] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:33] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39615 bytes in 1.088 seconds [14:23:08] wtf? [14:26:18] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:26] mark: about? [14:26:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.683 seconds [14:29:18] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.164 second response time [14:37:19] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:56:18] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [14:58:15] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on search1012 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on search1017 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on sq52 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on sq72 is CRITICAL: Puppet has not run in the last 10 hours [14:59:19] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [14:59:20] PROBLEM - Puppet freshness on sq77 is CRITICAL: Puppet has not run in the last 10 hours [14:59:20] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [14:59:21] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [14:59:21] PROBLEM - Puppet freshness on virt8 is CRITICAL: Puppet has not run in the last 10 hours [14:59:22] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [14:59:22] PROBLEM - Puppet freshness on yttrium is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on sq51 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on sq75 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on virt1008 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on sq84 is CRITICAL: Puppet has not run in the last 10 hours [15:00:14] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [15:01:15] PROBLEM - Puppet freshness on sq74 is CRITICAL: Puppet has not run in the last 10 hours [15:01:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:57] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:32] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: private wikis to 1.21wmf2 [15:07:44] Logged the message, Master [15:08:37] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 57804 bytes in 3.707 seconds [15:09:26] hrm? [15:09:49] strange [15:09:57] paravoid: yeah, was looking a bit strange earlier [15:10:10] load was rising a lot, but it seems to have leveled off.... [15:11:29] ah, very very large numer of mc connection in TIME_WAIT [15:12:39] * Reedy kicks FF [15:14:34] paravoid: do you know what box is being used to test redis? [15:15:35] no [15:15:36] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews to 1.21wmf2 [15:15:37] '10.0.12.1', # mc1 [15:15:43] According to commonsettings [15:15:45] Reedy: ty [15:15:49] Logged the message, Master [15:15:52] ah there [15:17:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [15:17:23] searchidx2: rsync: link_stat "/wikiversions.cdb" (in common) failed: Stale NFS file handle (116) [15:17:23] searchidx2: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9] [15:17:23] notpeter: ^ [15:17:38] uuuuuuuuuhhhhhhhhhhh [15:17:38] huh [15:17:44] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [15:17:52] worked second time [15:17:52] hmmm [15:17:53] NFS sucks? [15:17:56] Logged the message, Master [15:18:08] that shouldn't have any nfs mounts.... [15:18:42] perhaps was nfs hickup on fenari [15:19:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary and wikiversity to 1.21wmf2 [15:19:33] Logged the message, Master [15:19:33] what shouldn't? [15:19:34] I want memecache off of apaches so badly rignt now [15:19:34] searchidx? [15:19:36] yeah [15:19:47] yeah that makes two of us [15:19:57] paravoid: it used to use nfs back in the day, but has none now [15:20:01] hmm [15:20:06] (it = searchidx2) [15:20:15] Doesn't it use an nfs share to rsync from? [15:20:20] share/export [15:20:21] (like everything else) [15:20:37] Reedy: it rsyncs from /home [15:20:44] but it's not mounted [15:21:13] bleh [15:23:48] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:15] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.547 second response time [15:27:15] Reedy: how are things looking? [15:29:09] paravoid: I'm really tempted to just stop apache on all of the boxes that also have memecache on them [15:29:29] One of the lessor fatals is coming up on wikinews, so I've just stopped upgrading for a little while and loking at it [15:31:01] robla: just as an update, tim patched and deployed librsvg for precise last night [15:31:11] I have srv190 back into the rendering pool [15:31:15] notpeter: excellent! [15:31:18] I'm going to let it simmer for a bit [15:31:23] but I htink that it'll be good to go [15:31:43] as it was doing just fine before (albeit with a non-optimal librsvg build) [15:32:45] and once it's declared "good", upgrading all the imagescalers should take a couple of hours at most [15:38:18] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [15:38:45] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:51] temp stopping apache on srv238 [15:40:06] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.265 second response time [15:43:42] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [15:45:06] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:12] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [15:46:24] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [15:46:50] paravoid: I wonder if site performance would actually improve if we just turned off apache on all boxes that have memcache running... [15:47:16] our mw* boxes in pmtpa are very under-utilized [15:47:32] it'll be nice when we can bump the memory limits again [15:48:00] and it looks like our biggest bottleneck right now is actually fetching things from memcache (which seems to perform quite a bit worse on boxes that have apache running) [15:48:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:27] paravoid: or at least set the nice value of memecache to lower than that of apache [15:55:10] Reedy: there were a few issues marked as blockers here that didn't seem to block us: https://bugzilla.wikimedia.org/show_bug.cgi?id=38865 [15:56:32] It's useful to have those there for a reminder they need backporting etc when available [15:56:42] I just added 41178 as it should be dealt with before next monday [15:57:22] 41155 has a fix, but needs more review, merge and deployment. and 41122 Daniel and I are currently looking at [15:57:49] It'd be useful if we could find where this is being used, as it may be a reason to revert at the moment [15:57:56] 5/1000 isn't a lot, but still... [15:59:19] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:59:19] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:04:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.015 seconds [16:21:57] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:18] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 65774 bytes in 0.801 seconds [16:36:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:48] temp stopping apache on srv195 [16:47:18] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [16:47:34] robla: Do we want to do the rest of the set deployments? [16:47:46] Or just the rest of the set minus special (commons, meta..) [16:49:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [16:50:01] * robla ponders [16:50:18] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [16:50:51] so, private, wikinews, wiktionary and wikiversity + mw.o, test and test2 are all currently on it [16:51:13] 8 fatals in 1000 lines of apache logs, 2 warnings [16:52:04] Reedy: they all have bug tickets? [16:54:28] Reedy: let's hold off on commons and meta, and do those on Monday [16:54:46] * aude nods [16:57:37] aude: yup, I've logged bugs for all the ones we've seen [16:58:05] robla: so do the rest of todays deployment, minus commons and meta? [16:58:10] Reedy: thanks [17:01:34] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Rest of planned wikis to 1.21wmf2 minus commons and meta [17:01:43] :) [17:01:47] Logged the message, Master [17:02:03] I'll keep an eye on the logs and check it doesn't get much worse [17:10:49] New review: Asher; "I think puppet should have more flexibility around configuration options in solrconfig.xml in additi..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/26571 [17:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:28] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [17:36:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [17:37:06] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.154 second response time [17:38:36] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:45] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.620 second response time [17:46:27] New patchset: Jgreen; "switching locke to log *all* banners, not just fundraising, per zexley request" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28537 [17:47:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28537 [17:47:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28537 [17:54:46] !log adjusted banner log filter to include non-fundraising banners [17:55:00] Logged the message, Master [17:57:03] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:30] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:30] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:41] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:48] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:57] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:57] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:58:06] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:58:15] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:58:33] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:58:42] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:08:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:09] PROBLEM - Swift HTTP on ms-fe1002 is CRITICAL: Connection refused [18:14:34] New patchset: Dereckson; "(bug 41167) Namespace configuration for ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [18:16:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 254 seconds [18:17:13] New patchset: Pyoungmeister; "adding apache scorecard stats to ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28544 [18:18:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28544 [18:20:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [18:22:06] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [18:22:33] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [18:22:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.102 seconds [18:22:51] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [18:22:51] Change abandoned: Asher; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3898 [18:23:00] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:12] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:12] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:18] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:27] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:36] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:55] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [18:24:03] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [18:33:39] New patchset: Pyoungmeister; "adding apache scorecard stats to ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28544 [18:34:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28544 [18:38:02] New patchset: CSteipp; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [18:40:32] New patchset: CSteipp; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [18:45:59] so, I want to push this out https://gerrit.wikimedia.org/r/#/c/28544/ [18:46:12] but I think that because our puppet runs all stack up and complete at the same time [18:46:25] it would end up dosin' [18:46:42] thoughts? [18:50:18] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [18:51:49] notpeter, i dunno, but do they complete at the same time? you mean adding the Apache module on all or the Ganglia side? [18:52:06] the apache module [18:52:18] it has a notify service apache attached to it [18:53:51] hmm, if they actually finished at the same time then it could be used for deployment? [18:53:59] !log reedy synchronized php-1.21wmf2/includes/ [18:54:08] Logged the message, Master [18:54:14] heh, well, not all the time ;) [18:54:16] but like, http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=stafford.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:54:33] it takes a really long time compiling some node's configs [18:54:36] blocking everything else [18:54:57] then there's a huge network spike where it talks to everything else [18:55:11] we should really loadbalance that [18:55:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:26] hmm, so if it would not have the notify service attached, you would just restart them slower.. via dsh? [18:56:24] yeah... I dunno. I wish that this just wasn't an issue..... [18:58:37] New patchset: CSteipp; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [19:00:29] notpeter: comment out the notification to the service but apply the rest, then a slow running script using dsh to actually restart them one by one ? wouldnt know better either [19:00:55] yea, that sounds like the best bet [19:01:07] New review: CSteipp; "That's it for my changes. I think all of Matthias'es look good." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/28238 [19:07:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.189 seconds [19:17:11] New patchset: Ori.livneh; "Add extension PostEdit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28562 [19:19:35] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28562 [19:19:57] !log reedy synchronized php-1.21wmf2/includes/EditPage.php [19:20:10] Logged the message, Master [19:21:39] !log created ulsfo rt queue [19:21:51] Logged the message, notpeter [19:22:39] test.wikipedia.org seems to be down [19:22:46] paravoid: https://bugs.launchpad.net/swift/+bug/1065869 heh [19:23:01] i think i screwed it up [19:23:04] damnit, sec [19:25:42] no problem, that's what test is for :) [19:26:58] mwalker: ori's fixing [19:27:06] yep yep [19:27:20] so he said in staff [19:27:35] New review: MaxSem; "> I think puppet should have more flexibility around configuration options in solrconfig.xml in addi..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26571 [19:27:43] * mwalker buries head back in varnish config; mutters something about sane languages [19:28:43] !log adjusted a few hostnames for silverpop [19:28:54] Logged the message, Master [19:29:03] thanks Jeff_Green [19:29:07] np [19:29:31] and there goes ns2 [19:35:44] New patchset: Matthias Mullie; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [19:42:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:35] New patchset: Matthias Mullie; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [19:46:22] New patchset: Ottomata; "Installing mysql on analytics1001 for Oozie and Sqoop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28599 [19:47:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28599 [19:49:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28599 [19:54:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.596 seconds [20:06:40] scap starting [20:14:21] AaronSchulz: oh wow, srsly? [20:18:24] !log kaldari Started syncing Wikimedia installation... : [20:18:44] Logged the message, Master [20:26:12] New patchset: Dzahn; "add wikivoyagelb service IPS to lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28604 [20:27:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28604 [20:29:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:18] New review: Dzahn; "see inline comments. this is just to get this started and for review or team work if you feel like ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28604 [20:33:52] New review: Dzahn; "one more inline comment, no additions to esams at this point" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28604 [20:36:39] New review: Dzahn; "Mark/Leslie, do you prefer just adding one LB per change or do all the new ones at once" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28604 [20:43:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [20:58:15] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [21:01:10] !log reedy synchronized wmf-config/CommonSettings.php '$wgMaxImageArea to 15' [21:01:25] Logged the message, Master [21:12:00] !log kaldari Finished syncing Wikimedia installation... : [21:12:08] Logged the message, Master [21:16:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:12] PROBLEM - Puppet freshness on tarin is CRITICAL: Puppet has not run in the last 10 hours [21:19:15] PROBLEM - Puppet freshness on snapshot2 is CRITICAL: Puppet has not run in the last 10 hours [21:19:15] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [21:19:15] PROBLEM - Puppet freshness on sq58 is CRITICAL: Puppet has not run in the last 10 hours [21:19:15] PROBLEM - Puppet freshness on sq57 is CRITICAL: Puppet has not run in the last 10 hours [21:19:15] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [21:19:16] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [21:19:16] PROBLEM - Puppet freshness on sq82 is CRITICAL: Puppet has not run in the last 10 hours [21:19:17] PROBLEM - Puppet freshness on sq86 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on search1013 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on snapshot3 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on sq66 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on williams is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on sq36 is CRITICAL: Puppet has not run in the last 10 hours [21:20:22] PROBLEM - Puppet freshness on sq44 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [21:26:02] !log aaron synchronized php-1.21wmf2/includes/filebackend/FileBackendStore.php 'deployed ff03172cccad1ce7017c7dc44e508317cc73975f' [21:26:14] Logged the message, Master [21:30:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [21:45:14] grrr, anyone have experience with making nickserv work on znc bouncers [21:45:29] <- Leslie, but obviously auto nickserving not happy [21:52:16] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [21:54:42] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:22] !log reedy synchronized php-1.21wmf2/includes/api/ApiQueryRevisions.php [22:05:34] Logged the message, Master [22:14:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.579 seconds [22:23:18] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [22:31:55] * ori-l brbs [22:32:01] mark: do you happen to know what version of GeoIP you're running on bits? [22:32:15] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [22:32:30] !log aaron synchronized php-1.21wmf2/includes/filebackend/FileBackendStore.php 'deployed 15520f489b9adab8bcfd84e6b6a5d48280a85556' [22:32:42] Logged the message, Master [22:37:33] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [22:48:51] New patchset: Dereckson; "Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28625 [22:50:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:10] New patchset: Dereckson; "Unit testing for InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28627 [22:54:32] New review: Dereckson; "Work in progress, don't submit now." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/28627 [22:55:24] AaronSchulz: so, did you find out anything re: yesterday's bug? [22:55:48] there is some stale stuff in memcached [22:56:37] stat info to be specific [22:56:51] which does not interact well with how thumb.php streams [22:57:05] aha [22:57:08] okay, not much I can do then [22:57:16] if there is, feel free to ask [22:57:36] the logs show some stat cache update failures, only a few, and not for all the affected files [22:57:44] so there is some degree of mystery [22:57:58] heh [22:58:06] heisenbugs are the best [23:00:08] !log tstarling synchronized php-1.21wmf2/includes/User.php 'debugging hack' [23:00:23] Logged the message, Master [23:00:31] on a completely unrelated note [23:01:06] the TMH bug to make the Cortado configurable was fixed a while ago [23:01:11] but I'm curious, did we actually configure it? :) [23:05:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [23:07:32] !log tstarling synchronized php-1.21wmf2/includes/User.php 'debugging hack' [23:07:44] Logged the message, Master [23:10:04] New patchset: Dereckson; "Removing settings for no more existant wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28631 [23:11:18] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [23:13:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikimedia.dblist to 1.21wmf2 [23:13:37] Logged the message, Master [23:14:26] !log spage Started syncing Wikimedia installation... : updating 1.21wmf1 & wmf2 with E3Experiments (and 1.21wmf1 with PostEdit) [23:14:38] Logged the message, Master [23:16:28] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:16:36] Logged the message, Master [23:17:18] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [23:21:45] New patchset: Dereckson; "Unit testing for InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28627 [23:22:15] binasher: ping? [23:23:34] binasher: what are your memcached test box(es)? packages are on their way to apt but I'd like to test them to be sure this time :) [23:24:28] New patchset: Dereckson; "Removing pt.wikimedia configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28636 [23:31:35] !log spage Finished syncing Wikimedia installation... : updating 1.21wmf1 & wmf2 with E3Experiments (and 1.21wmf1 with PostEdit) [23:31:49] Logged the message, Master [23:34:29] New review: Tychay; "I'll post this note to editor engagement to give them a heads up. :-)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/27830 [23:38:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:20] !log reedy synchronized wmf-config/ [23:42:35] Logged the message, Master [23:43:53] New review: Tychay; "On second thought, I want to have a second to inform wikibooks and wikinews that this feature is goi..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/27830 [23:49:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.775 seconds [23:49:29] I just created an account on enwiki, and got "Sorry! This site is experiencing technical difficulties.Try waiting a few minutes and reloading.(Cannot contact the database server: Unknown database 'dewikivoyage' (10.0.6.44))" [23:51:42] New review: Dereckson; "WIP - Don't merge now." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/28627 [23:51:50] csteipp: ^^ [23:52:10] CentralAuth says no [23:52:11] Uhg [23:52:55] that would be centralauth [23:55:05] It happened again trying to create an account. "(Cannot contact the database server: Unknown database 'dewikivoyage' (10.0.6.44))" [23:56:01] So the only place that exists is all.dblist [23:56:14] I have no idea where centralauth is picking it up [23:57:50] New patchset: Ori.livneh; "Enable PostEdit for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28641 [23:58:41] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28641 [23:59:12] csteipp, are you saying that account creation is going wrong at CentralAuth logging you in to WM proejcts? Sounds plausible. User login works OK, "Logging you in to Wikimedia's other projects" [23:59:49] Yeah, I'm guessing centralauth is trying to create attached accounts