[00:03:59] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [00:04:44] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [00:07:08] RECOVERY - NTP on mw57 is OK: NTP OK: Offset -0.01100289822 secs [00:11:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:35] RECOVERY - NTP on mw58 is OK: NTP OK: Offset -0.007546186447 secs [00:20:32] !log adding an autofs direct map to LDAP, pointing at the gluster storage for testing home directory move [00:20:41] Logged the message, Master [00:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.179 seconds [00:26:20] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.005 seconds [00:30:10] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [00:35:25] PROBLEM - Apache HTTP on srv190 is CRITICAL: Connection refused [00:44:25] RECOVERY - NTP on mw59 is OK: NTP OK: Offset -0.0138194561 secs [00:50:07] RECOVERY - NTP on srv190 is OK: NTP OK: Offset -0.03206479549 secs [00:56:18] New patchset: Pyoungmeister; "rejiggering eqiad mw numbers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24104 [00:57:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24104 [00:58:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:19] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.076 second response time [01:00:37] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [01:01:13] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [01:02:20] New review: Hashar; "You can set a global variable but labsconsole is currently unable to pass parameters to a parameteri..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/23770 [01:04:53] New patchset: Pyoungmeister; "rejiggering eqiad mw numbers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24104 [01:05:25] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:43] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:05:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24104 [01:07:31] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 108.47 ms [01:08:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24104 [01:12:19] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [01:12:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.356 seconds [01:42:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 288 seconds [01:43:04] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 283 seconds [01:45:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [01:45:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:04] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 25 seconds [01:56:52] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:01:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:32:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.369 seconds [02:59:07] New patchset: Krinkle; "Clean up 404-ed symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24110 [02:59:38] New review: Krinkle; "Already gone, uncommitted." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/24110 [02:59:38] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24110 [03:15:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:27:45] RECOVERY - Puppet freshness on mw22 is OK: puppet ran at Tue Sep 18 03:27:38 UTC 2012 [04:39:43] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [04:39:43] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:44:13] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [05:57:16] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [06:30:16] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [07:47:24] New review: Nemo bis; "Logos have been protected and can now be safely used. https://commons.wikimedia.org/w/index.php?titl..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23985 [08:36:20] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 186 seconds [08:36:56] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [08:37:50] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:38:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [09:04:06] hi guillom [09:04:15] hello Nemo_bis [09:10:32] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [09:10:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:10:32] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [09:10:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:10:32] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:10:32] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:50:12] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [11:07:08] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:27:56] New review: ArielGlenn; "Why not add a case in the else clause of the if thumbpath stanza, which checks for the short form of..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23309 [11:57:35] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [11:57:35] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:22:45] !log reboot ms-fe1, upgrading to precise [12:22:55] Logged the message, Master [12:30:13] PROBLEM - SSH on ms-fe1 is CRITICAL: Connection refused [12:30:22] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [12:30:58] PROBLEM - Memcached on ms-fe1 is CRITICAL: Connection refused [12:49:16] PROBLEM - NTP on ms-fe1 is CRITICAL: NTP CRITICAL: No response from NTP server [12:50:01] RECOVERY - SSH on ms-fe1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:02:46] RECOVERY - NTP on ms-fe1 is OK: NTP OK: Offset 0.02798581123 secs [13:14:31] New patchset: ArielGlenn; "add path for rewrite.py for python 2.7 in precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24123 [13:15:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24123 [13:16:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:25:44] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24123 [13:35:53] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.019 seconds [13:36:02] RECOVERY - Memcached on ms-fe1 is OK: TCP OK - 0.003 second response time on port 11211 [13:36:17] New patchset: Hashar; "Clean up: Removed old test files out of wikipedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23798 [13:36:23] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23798 [13:41:33] !log {{gerrit|23798}} manually deleted /apache/common-local/docroot/wikipedia.org/{million,test}.txt files from all mediawiki-installation hosts [13:41:43] Logged the message, Master [13:41:56] New review: Hashar; "I have manually deleted them from all mediawiki-installation hosts." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23798 [13:48:24] !log ms-fe1 back in swift frontend pool, upgrade to precise complete. [13:48:33] Logged the message, Master [14:06:44] New patchset: Hashar; "(bug 40122) disable GlobalBlocking on fishbowl, private wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23286 [14:07:35] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23286 [14:08:49] New review: Hashar; "Deployed live at Tue, 18 Sep 2012 14:08:38 +0000" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23286 [14:28:27] New patchset: Jeroen De Dauw; "Update repos used by wikidata" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24129 [14:29:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24129 [14:40:28] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:40:28] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:33:42] New review: Pyoungmeister; "this is just a redo of https://gerrit.wikimedia.org/r/#/c/24009/" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/24021 [15:33:42] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/24021 [15:39:14] New patchset: Hashar; "fix AFTv5 on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24132 [15:41:09] !log srv249 powercycling to check drac configuration [15:41:19] Logged the message, Master [15:41:35] I assume there are no disks yet? (not that I will be here, just checking) [15:41:59] apergos: no, not yet. I am waiting on a response to see if they were even shipped in time [15:42:05] heh [15:42:06] ok [15:44:57] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [15:49:18] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [15:52:27] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:56:39] notpeter: i was checking on this....i shut srv249 down and the server physically shutdown but srv250 apache is reporting critical [15:57:15] cmjohnson1: yeah, there's somehting funky about the numbering [15:57:18] this sounds familiar.... [15:57:34] mark found it the other day [15:57:37] (sunday) [15:57:51] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [16:21:26] New patchset: Pyoungmeister; "incrementing changelog" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/24136 [16:22:29] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/24136 [16:51:06] PROBLEM - Host mw55 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:30] !log search32 going down to check if it's DIMM error or possible board or cpu1 error [16:52:40] Logged the message, Master [16:56:48] RECOVERY - Host mw55 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:00:24] PROBLEM - SSH on mw55 is CRITICAL: Connection refused [17:01:00] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:24] RECOVERY - SSH on mw55 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:07:52] !log putting srv190 and mw56-59 back into apaches pool [17:08:01] Logged the message, notpeter [17:15:40] New review: Hashar; "wgArticleFeedbackLotteryOdds now use the same value as in production." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/24132 [17:15:40] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24132 [17:19:18] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.019 seconds [17:21:21] New review: Aaron Schulz; "It also makes the code here more complicated and fragile to improve a case that already fails fast a..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/23309 [17:22:22] AaronSchulz: it's just that the thumb will be the short url some small percentage of the time, this was meant to deal with the rest of the cases, it seems a shame to lose all that [17:22:55] (and I gotta run) anyways if you feel strongly that in spite of that the code should just go and there should be no check, add that to your comment and I'll merge it tomorrow morning [17:22:57] apergos: did you see my comment above? [17:23:24] anyway, eventually, I'd prefer them to all be short (for consistency) and we'd rely on disposition [17:23:31] yes, that's what I was just answering to [17:24:02] ok, if that's not in the comment, add a one liner about that and like I say I'll make it live tomorrow (gotta go now, am late) [17:25:36] PROBLEM - NTP on mw55 is CRITICAL: NTP CRITICAL: Offset unknown [17:30:23] New review: Aaron Schulz; "Note that we at least need short thumbs for files of length >= ~160 for good measure." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/23309 [17:33:05] New patchset: Pyoungmeister; "removing excessive ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24140 [17:34:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24140 [17:34:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24140 [17:52:37] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [17:56:34] Mobile throwing 503s? https://bugzilla.wikimedia.org/show_bug.cgi?id=40333 [17:56:44] Had someone just ping me offline about it. [18:00:59] thanks [18:03:13] now I'm here [18:03:40] binasher: good that you're at home, office network hiccups [18:04:46] paravoid: if notpeter is in, he probably has a mifi.. verizon lte is more reliable than the office net [18:05:00] I have Leslie's MiFi :-) [18:05:40] booted it up to look at the mobile outage [18:05:44] are you also looking at it? [18:07:22] yeah [18:09:12] unable to get a 503 browsing on m. and i'm not seeing backend_fail or fetch_failed incrementing on varnish [18:09:28] same here [18:11:22] i see lots of allocator failures on the frontends though [18:11:51] SMA.s0.c_fail 54020181198576 29441983.73 Allocator failures [18:11:58] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [18:24:43] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [18:56:01] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [18:59:11] New patchset: Hashar; "(bug 40186) New logo for es.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23652 [19:00:41] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23652 [19:06:58] PROBLEM - Apache HTTP on mw55 is CRITICAL: Connection refused [19:09:12] New patchset: Jgreen; "add frack hosts to gmetad.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24149 [19:10:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24149 [19:10:18] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24149 [19:11:19] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [19:11:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:11:19] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [19:11:19] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [19:11:19] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:11:19] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [19:15:04] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [19:16:00] New patchset: Jdlrobson; "transform magic words in message" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24180 [19:27:14] New review: Hashar; "will fix some manually." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23998 [19:29:26] New patchset: Hashar; "Run stylize.php on InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23998 [19:32:34] New review: Hashar; "Removed some non sense styles" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/23998 [19:32:38] New patchset: Hashar; "Run stylize.php on InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23998 [19:33:09] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23998 [19:37:06] New patchset: Hashar; "enable EPUB collection format on ALL wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24207 [19:37:40] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24207 [19:42:28] New patchset: Hashar; "(bug 39905) Create interface editor group on pt.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22370 [19:42:35] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22370 [19:45:07] New patchset: Hashar; "(bug 39942) Disables patrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22999 [19:45:23] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22999 [19:46:57] New patchset: Pyoungmeister; "correcting regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24209 [19:47:53] New patchset: Hashar; "Cleaning Berlin hackathon tutorial configuration." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23062 [19:47:53] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23062 [19:47:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24209 [19:48:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24209 [19:48:48] New patchset: Hashar; "Clean up: Removed sep11wiki code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22534 [19:49:08] New review: Hashar; "Thanks for the clean up!!" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22534 [19:49:08] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22534 [19:51:13] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [19:53:02] New review: Hashar; "One of the problem with that change is the community will have to get someone having sysadmin right ..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23985 [19:56:06] New patchset: Hashar; "(bug 40270) jawiki: set wgAutoConfirmCount = 10" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23927 [19:57:24] New review: Hashar; "tweaked commit message a bit." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/23927 [19:58:52] PROBLEM - Host mw55 is DOWN: PING CRITICAL - Packet loss = 100% [19:58:52] PROBLEM - Host mw56 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:11] New patchset: Hashar; "Removes transcoding stub" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17365 [20:03:38] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/17365 [20:05:44] RECOVERY - Host mw56 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [20:05:44] RECOVERY - Host mw55 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [20:05:53] New review: Nemo bis; "No, the v2 logo is the logo they made after some 3D studies in 2010, improving the puzzle and so on." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23985 [20:07:05] New patchset: Hashar; "(bug 40173) Set wgArticleCountMethod for gu.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23490 [20:07:12] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23490 [20:09:16] New review: Hashar; "$ mwscript updateArticleCount.php --wiki guwikisource --update" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23490 [20:09:38] PROBLEM - SSH on mw56 is CRITICAL: Connection refused [20:09:47] PROBLEM - SSH on mw55 is CRITICAL: Connection refused [20:09:56] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [20:12:46] New patchset: Hashar; "(bug 39652) Fix "autoreviewer" restriction level on ptwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23997 [20:12:58] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23997 [20:14:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24129 [20:16:51] New patchset: Hashar; "disable anon edits on wikimania2012 wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23795 [20:17:12] New review: Hashar; "* tweaked commit message" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23795 [20:17:12] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23795 [20:17:52] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23795 [20:25:16] New review: Dzahn; "yep, sync-apache is /usr/local/bin/sync-apache. Note though that "sync" is /bin/sync which is someth..." [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23033 [20:25:16] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23033 [20:25:52] mutante: our hero :-)) [20:26:19] hashar: heh, yw. Just saying that "sync" by itself is something completey unrelated [20:26:43] we need to cleanup the scripts from /home/wikipedia/bin [20:26:50] there is ton of old / outdated ones there [20:26:53] PROBLEM - NTP on mw56 is CRITICAL: NTP CRITICAL: No response from NTP server [20:26:54] and some that are not in puppet yet [20:27:01] yes, i agree we wanted stuff to move away from /h/w/bin [20:27:20] RECOVERY - SSH on mw56 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:27:44] if you echo $PATH on fenari, /h/w/bin is in it though [20:28:00] and /h/w/conf etc. is not [20:28:05] RECOVERY - SSH on mw55 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:28:26] so i you just do a "sync" that will not be sync-apache, it will be sync as in "man sync" [20:30:00] blame tim for it :-] [20:30:07] I guess the point was to use ./sync in the dir [20:30:36] yea, ok [20:30:51] speaking of 'man', I have started a change to document our scripts : https://gerrit.wikimedia.org/r/#/c/16606/ [20:31:12] it provides the foundations to build man pages out of asciidoc files [20:31:18] nice! [20:31:19] (kind of like markdown markup) [20:31:52] that might help up enrolling newcomers later on [20:32:06] of course, most of the scripts are not documented yet, i am doing that on my spare time [20:32:24] feel free to pass the word around in the SF office [20:32:30] and then you want to use "a2x" to convert that to manpages? [20:32:47] ok, cool, will do [20:32:58] yeah a2x is based on docbook [20:33:15] apparently converts asciidoc to some docbook xml file then generate the manpages using XSLT [20:33:23] (i love xml / xslt ) [20:33:37] the makefile is pretty straightforward, [20:33:52] might need some sanity check and tips to install the required software [20:34:08] I was looking at having puppet generate the man page automatically but haven't found out a way to generate them [20:34:14] might need some script on puppet master [20:37:32] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [20:39:46] New patchset: Demon; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [20:40:40] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/24213 [20:43:42] hashar: if you ever want some specific man page reviewed, let me know, I'd like to do that [20:44:00] New patchset: Demon; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [20:44:18] chrismcmahon: it is more about writing them first :-)) [20:44:28] chrismcmahon: will definitely let you know :) [20:44:30] hashar: indeed :) [20:44:56] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/24213 [20:44:57] that's all stuff I wish had better docs than exist today. [20:45:49] New patchset: Pyoungmeister; "explicitly setting package versions for apaches to stop getting ubuntu versions on install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24214 [20:46:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24214 [20:46:58] New patchset: Demon; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [20:47:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24213 [20:48:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24214 [20:51:07] New patchset: Demon; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [20:52:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24213 [20:52:09] off to bed now [20:52:13] see you tomorrow [20:52:15] nn hashy [20:52:59] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [20:53:52] New patchset: Demon; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [20:54:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24213 [20:57:47] New patchset: Ryan Lane; "Don't use None in salt configs as a default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24216 [20:58:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24216 [21:00:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24216 [21:04:50] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [21:08:17] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:10:10] New review: Demon; "PS5 tested on labs, working. Should have zero net impact for production." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/24213 [21:10:41] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [21:19:05] RECOVERY - NTP on mw56 is OK: NTP OK: Offset -0.006861448288 secs [21:26:44] RECOVERY - NTP on mw55 is OK: NTP OK: Offset -0.07633185387 secs [21:50:26] PROBLEM - Apache HTTP on mw55 is CRITICAL: Connection refused [21:55:59] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [21:58:41] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [21:58:41] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:16:51] New patchset: Ryan Lane; "Give demo-deployment1 deploy runner rights in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24225 [22:17:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24225 [22:17:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24225 [22:18:28] wait. shit. that isn't going to work [22:18:47] ugh. the problem with having the master for labs in production [22:18:52] New patchset: Jdlrobson; "transform magic words in message" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24180 [22:22:53] New patchset: Ryan Lane; "Fix setting peer runner info for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24227 [22:23:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24227 [22:23:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24227 [22:29:39] New patchset: Ryan Lane; "Fix external_nodes setting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24229 [22:30:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24229 [22:32:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24229 [22:35:41] New patchset: Ryan Lane; "Explicitly name the deployment host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24231 [22:36:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24231 [22:36:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24231 [22:50:17] New patchset: Catrope; "Send gerrit-wm notifications for Parsoid to #parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24233 [22:51:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24233 [22:52:52] New review: MarkTraceur; "I like it! Thanks, Roan." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/24233 [22:55:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24233 [22:57:38] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Tue Sep 18 22:57:21 UTC 2012 [23:02:17] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 224 seconds [23:02:17] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 225 seconds [23:02:17] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 226 seconds [23:02:53] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 260 seconds [23:02:53] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [23:02:55] Seem to be getting (Cannot contact the database server: Unknown error (10.0.6.73)) on en wiki =/ [23:02:55] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 261 seconds [23:02:56] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 261 seconds [23:03:02] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 273 seconds [23:03:13] New patchset: Pyoungmeister; "pin lucene version in eqiad, unpin in pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24235 [23:03:22] cannot do this [23:03:24] * Damianz pokes Ryan_Lane if he's still around ^ [23:03:38] my misfrtune I just sat down but it's 2am here [23:03:44] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 279 seconds [23:03:44] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 279 seconds [23:03:44] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 280 seconds [23:04:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24235 [23:04:29] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 358 seconds [23:04:47] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 374 seconds [23:04:47] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 374 seconds [23:04:54] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24235 [23:05:14] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.312 second response time [23:05:16] * Ryan_Lane groans [23:05:44] god damn it. what's the log server name again? [23:05:52] I *really* hate it [23:05:52] fluorine? [23:05:59] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [23:05:59] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [23:06:08] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [23:06:15] 2 $> ssh root@flourine [23:06:15] ssh: Could not resolve hostname flourine: Name or service not known [23:06:17] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [23:06:17] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [23:06:17] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [23:06:17] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [23:06:17] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 1 seconds [23:06:59] New patchset: Jdlrobson; "refine contact us emails to include referring page and whether from app (bug 36388)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238 [23:07:01] why can't we just name things with names that actually go with the function? [23:07:02] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [23:07:02] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [23:07:02] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 0 seconds [23:07:02] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [23:07:02] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [23:07:10] there is a mysql eqiad spike visibl in ganglia but not much else (already past) [23:07:27] for outgoing only [23:08:06] uh [23:08:09] fluorine? [23:08:15] uo [23:08:24] wonderful [23:08:45] yeah that's it [23:08:49] we really have *terrible* naming schemes [23:09:13] At least the apache/mysql/lvs boxes are sorta named in a vaugly logical way :P [23:09:27] because they are clustered [23:09:28] New patchset: Jdlrobson; "refine contact us emails to include referring page and whether from app (bug 36388)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238 [23:09:36] clustered systems get named by function [23:09:42] at least it should have a cname [23:10:42] * binasher just got back from a late lunch  [23:11:15] sorry I'm not going to be much help guys [23:11:16] jobrunners took a serious dip at the same time [23:11:33] yeah I saw that [23:11:38] dunno if it's related [23:11:55] cause the incoming traffic dropped a ton too [23:12:15] apergos: did you see that the math/ copying is done :) [23:12:22] yes, yay [23:12:39] captcha has a patch in gerrit too [23:12:46] sweet [23:12:50] knocking em out one by one [23:13:01] * Damianz finds a donut for apergos [23:13:02] someone needs to assign someone to the tmh jar, maybe robla [23:13:14] apergos: I guess the hardest one is ext-dist [23:13:20] that'll suck [23:13:24] * AaronSchulz leans towards the "separate wiki" approach [23:13:49] throw MW on a box and use that extension ;) [23:13:55] :-D [23:13:56] I wonder if there was a network issue during that time [23:13:59] mediawiki.org :-P [23:14:08] except not [23:14:18] heh [23:16:03] graphs don't generally line up for that [23:16:27] I'd expect a much clearer pattern [23:16:46] job runners dip, mysql replication spikes then goes back to normal [23:16:56] application servers load dips [23:18:00] eqiad squids dip [23:18:02] look at mysql pmtpa graph for network [23:18:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:19:34] maybe atop on one of those eqiad boxes would show something [23:20:13] New patchset: Jdlrobson; "move from deprecated wfMsg to wfMessage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24243 [23:25:54] the enwiki master went oom [23:26:26] oh, had you found it and fixed it? [23:26:43] or was it still OOM'd till just now? [23:26:53] it fixed itself [23:27:03] how does that occur? [23:27:18] which part? [23:27:27] the master fixing itself [23:27:31] it rebooted? [23:28:17] or is our oom killer actually configured in a way that doesn't make the system basically dead? [23:28:18] mysqld was killed by the oom killer [23:28:41] it looks like just before, the majority of running transactions were from the wikiadmin user [23:28:50] so, stuff from the job queue [23:28:55] ugh [23:29:13] " mysqld was killed by the oom killer" ... that should be a bz quip [23:29:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 217 seconds [23:30:10] RECORD LOCKS space id 0 page no 27539411 n bits 128 index `PRIMARY` of table `enwiki`.`job` trx id AE13B126 lock_mode X locks rec but not gap [23:30:11] Record lock, heap no 34 PHYSICAL RECORD: n_fields 8; compact format; info bits 0 [23:30:12] 0: len 4; hex 00f73df6; asc = ;; [23:30:14] 1: len 6; hex 0000ab56f9c2; asc V ;; [23:30:15] 2: len 7; hex 80029a401a02ca; asc @ ;; [23:30:17] 3: len 13; hex 726566726573684c696e6b7332; asc refreshLinks2;; [23:30:18] 4: len 4; hex 8000000a; asc ;; [23:30:20] 5: len 6; hex 4e6176626f78; asc Navbox;; [23:30:21] 6: len 30; hex 613a343a7b733a353a227461626c65223b733a31333a2274656d706c6174; asc a:4:{s:5:"table";s:13:"templat; (total 185 bytes); [23:30:22] 7: len 14; hex 3230313230393138313631393331; asc 20120918161931;; [23:30:43] refreshLinks2 job for the Navbox template [23:31:05] yay... [23:32:02] if two hundred of them queued for same lock, then deadlock detection could've taken it all down [23:32:58] mysql> select distinct job_title as title, count(job_title) as count, job_namespace from job group by title order by count desc limit 1; [23:32:59] +--------+--------+---------------+ [23:33:01] | title | count | job_namespace | [23:33:02] +--------+--------+---------------+ [23:33:03] | Navbox | 126086 | 10 | [23:33:04] +--------+--------+---------------+ [23:33:05] 1 row in set (0.20 sec) [23:33:41] that's a lot [23:34:19] Ryan_Lane: re: recovery post oom kill, that was just mysqld_safe doing its job [23:34:36] domas: you guys disable deadlock detection, right? [23:34:41] ah [23:34:54] yeh [23:35:05] we don't have deadlocks too [23:35:13] ;-) [23:35:43] what happens if a hundred txns try to lock the same row with deadlock detection disabled? [23:35:59] depends whether hundreds or thousands ;-) [23:36:08] if hundreds, some of them will have lock wait timeouts [23:36:14] we run with very low lock wait timeouts too [23:36:15] 2 seconds or so [23:36:29] domas: if fb would only make the Pages that mirror each wikipedia page group editable, we could just shut down wmf [23:36:36] yup [23:36:45] and fb wouldn't notice load change [23:36:49] hehehe [23:36:57] heh [23:37:11] there's some cliff at much higher number of rows dequeueing from lock graph [23:37:22] but that one is way more difficult to reach than the deadlock one [23:37:52] bah, speaking of lock timeouts, we're getting a lot on user.user_touched updates [23:37:55] http://dom.as/2009/12/21/on-deadlock-detection/ [23:38:05] didn't I file a bugzilla thing on that? [23:38:40] I was investigating those, it was general stupidity there iirc [23:40:30] things depending on things depending on things [23:40:33] are we updating it on every auth'd pageview [23:40:35] ugh [23:40:50] not normally [23:40:57] but there is some case were that can happen [23:41:22] like a bot storing the global cookie and not the local ones or something [23:42:05] it is globalauth code [23:42:10] centralauth that is [23:42:44] AaronSchulz: re: ton 'o jobs.. unavoidable if the navbox template is edited? [23:43:06] probably [23:45:03] binasher: just imagine how much better this will get with wikidata [23:45:58] Ryan_Lane: :( :( :( :( :( :( :( :( :( [23:48:34] because moving infoboxes to a shared data service should require fully reparsing everything across every using any of that data across every wiki instead of.. i dunno, always loading it dynamically [23:48:44] RAGE [23:48:56] * Damianz looks at binasher funny [23:52:35] binasher: but won't lua solve EvERythINgz? :D [23:56:55] binasher: no standardized infoboxes ;) [23:57:08] yes, it will solve everything including my jet lag. [23:57:13] so, it's not really possible for wikidata to generate them. only the data items will change [23:57:13] * apergos goes away... see folks tomorrow [23:57:19] so every data item change will result in a reparse [23:57:21] \o/ [23:58:49] binasher: DBAHULK is your twitter account?:) [23:59:38] i wish! i'm not that lulzy tho