[00:02:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:04:50] New review: Hashar; "Ooops :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39158 [00:05:07] hehe [00:05:09] both merged fwiw [00:05:20] works for me [00:05:23] thanks ! [00:06:41] I will still do my changes in puppet anyway. Ops reviews are priceless. [00:10:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:50] !log temp stopping lsave on es1009 and es1010 for upcoming networking downtime [00:11:58] Logged the message, notpeter [00:14:37] New patchset: Ori.livneh; "rsync eventlogging logs to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39161 [00:15:20] PROBLEM - MySQL Replication Heartbeat on es1009 is CRITICAL: CRIT replication delay 231 seconds [00:16:41] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on es1009 is CRITICAL: CRIT replication delay 231 seconds peter for network downtime [00:16:41] PROBLEM - MySQL Replication Heartbeat on es1010 is CRITICAL: CRIT replication delay 268 seconds [00:16:53] thank you hashar :) [00:17:11] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on es1010 is CRITICAL: CRIT replication delay 268 seconds peter for network downtime [00:17:27] LeslieCarr: not sure why but you are welcome :D [00:17:28] mutante: I'm ack'ing things just for you :) [00:17:56] notpeter: thanks:) [00:18:01] for the ops reviews are priceless [00:18:14] notpeter: some day i want that to be a bot feature [00:18:16] kind of stating the obvious :-D [00:20:08] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL - No data received from host [00:25:32] hashar: i wanted to give you full access, but already done :) [00:25:38] hashar doesn't need fine-grained sudo on gallium anymore, as he's root. [00:25:49] paravoid added you [00:26:11] mutante: sorry :-]  we have been talking about it like half an hour ago [00:26:25] !log starting upgrade of asw-c-eqiad.mgmt - connectivity to row c machines may be affected [00:26:29] i see you wrote the change even ..nice! [00:26:34] Logged the message, Mistress of the network gear. [00:26:45] mutante: ops taught me to use puppet so I do! [00:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:27:22] hashar: awesome:) [00:28:02] hashar: resolved.. cat /etc/sudoers.d/hashar [00:29:53] PROBLEM - Host analytics1018 is DOWN: CRITICAL - Network Unreachable (10.64.36.118) [00:29:53] PROBLEM - Host analytics1020 is DOWN: CRITICAL - Network Unreachable (10.64.36.120) [00:29:54] PROBLEM - Host analytics1025 is DOWN: CRITICAL - Network Unreachable (10.64.36.125) [00:29:54] PROBLEM - Host analytics1024 is DOWN: CRITICAL - Network Unreachable (10.64.36.124) [00:29:54] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Network Unreachable (10.64.36.127) [00:30:02] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [00:30:02] PROBLEM - Host es1009 is DOWN: CRITICAL - Network Unreachable (10.64.32.19) [00:30:03] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [00:30:29] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Network Unreachable (10.64.36.122) [00:30:29] PROBLEM - Host es1010 is DOWN: CRITICAL - Network Unreachable (10.64.32.20) [00:30:38] PROBLEM - Host analytics1017 is DOWN: CRITICAL - Network Unreachable (10.64.36.117) [00:30:38] nagios-wm: are they in the same rack? [00:30:39] PROBLEM - Host analytics1016 is DOWN: CRITICAL - Network Unreachable (10.64.36.116) [00:30:39] PROBLEM - Host analytics1021 is DOWN: CRITICAL - Network Unreachable (10.64.36.121) [00:30:56] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Network Unreachable (10.64.36.111) [00:31:05] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Network Unreachable (10.64.36.119) [00:31:05] PROBLEM - Host analytics1013 is DOWN: CRITICAL - Network Unreachable (10.64.36.113) [00:31:14] PROBLEM - Host analytics1015 is DOWN: CRITICAL - Network Unreachable (10.64.36.115) [00:31:14] PROBLEM - Host analytics1023 is DOWN: CRITICAL - Network Unreachable (10.64.36.123) [00:31:23] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Network Unreachable (10.64.36.126) [00:31:24] PROBLEM - Host analytics1014 is DOWN: CRITICAL - Network Unreachable (10.64.36.114) [00:31:24] PROBLEM - Host analytics1012 is DOWN: CRITICAL - Network Unreachable (10.64.36.112) [00:31:45] ^ dschoon / drdee / ottomata? [00:31:56] that looks lovely. [00:32:05] that's the asw-c-eqiad reboot [00:32:07] oh, probably leslie's upgrade. [00:32:08] that's scheduled [00:32:10] right. sorry. [00:32:12] yep [00:32:14] that's me [00:32:17] mwhahaha [00:32:18] nice. [00:32:24] i didn't like those machines anyway. [00:32:28] oh noes, i forgot about es1007 and es1008 [00:32:31] they're always making me do stuff. [00:32:33] sorry notpeter :-/ [00:32:40] !log fixing fenari permissions for gwicke.. (pre-puppet age UID) [00:32:48] Logged the message, Master [00:32:53] you know, i always thought this job would be easier if we just disabled all the external access [00:32:57] users always cause scaling issues [00:33:02] data you need to analyze [00:33:20] use domas optimizer: // [00:35:24] night time, have a good afternoon [00:35:33] gwicke: 2006, February 13 :) [00:36:26] * Susan smiles at Domas optimizer. [00:36:35] mutante: that was the last change to my permissions? [00:36:40] yeah:) [00:36:47] k ;) [00:37:04] puppet could not update the file, it did not have permissions:) [00:37:11] fixed and running again [00:37:32] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [00:37:32] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [00:37:41] gwicke: welcome back! should work now [00:37:51] gabriel@tosh is in place [00:38:14] Is there a host named "tosh"? [00:38:24] mutante: awesome, thanks! [00:38:32] Susan: yes, my laptop [00:38:34] Susan: somewhere under gwickes control, yea [00:38:40] bonus for guessing the brand ;) [00:38:44] Heh. [00:38:52] gwicke: ooh..i thought Reggae music [00:38:57] Ah, "tosh" is an English Wiktionary in-joke. [00:39:11] hehe [00:39:23] https://en.wiktionary.org/wiki/tosh and related deletion log summaries. [00:39:31] Susan: wiktionary user! yay! [00:39:45] I'm just familiar with its culture. ;-) [00:40:12] my original home wiki i guess [00:40:30] "Stupidity" still being in block drop-down menu is... classic English Wiktionary. [00:40:41] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [00:40:50] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [00:41:43] well this is good…. the upgrade/reboot to fix everything made all lacp stay down [00:42:17] it was only supposed to be a 3 hour tour! [00:57:09] !log asw-c-eqiad unreachable due to lacp issue [00:57:17] Logged the message, Mistress of the network gear. [01:00:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:32] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:38] New review: Dzahn; "stop..breakage ..UID 618 is duplicate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35924 [01:14:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 231 seconds [01:15:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [01:15:21] New patchset: Dzahn; "fix UID for sbernarding to 623, duplicate UID usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39172 [01:15:22] New patchset: Dzahn; "fix UID for maryana to 624, duplicate UID usage (609/csteipp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39173 [01:15:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39172 [01:16:04] New review: Dzahn; "https://gerrit.wikimedia.org/r/39172" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35924 [01:18:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:18:48] New review: Dzahn; "this used a duplicate UID.. same with csteipp ..was 609. fixed in https://gerrit.wikimedia.org/r/#/c..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24008 [01:19:07] LeslieCarr: please install fix_lacp.exe [01:21:28] New review: Dzahn; "https://rt.wikimedia.org/Ticket/Display.html?id=3517" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24008 [01:28:09] !log fixing duplicate UID issue on stat1 for maryana [01:28:19] Logged the message, Master [01:28:34] New review: Dzahn; "fix RT-3517" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/39173 [01:28:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39173 [01:29:49] New review: Tychay; "By the power of Greyskull, I lack the powerz to +2." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/32362 [01:36:29] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:36:46] err: /Stage[main]/Misc::Statistics::Mediawiki/Git::Clone[statistics_mediawiki]/Exec[git_pull_statistics_mediawiki]/returns: change from notrun to 0 failed: git pull --quiet returned 1 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:679 [01:48:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.806 seconds [02:05:26] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [02:11:08] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:08] RECOVERY - Host es1009 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:09] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:11:09] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:17] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:11:17] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [02:11:18] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [02:11:26] RECOVERY - Host analytics1025 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [02:11:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:35] RECOVERY - Host analytics1023 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [02:11:35] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [02:11:36] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 26.97 ms [02:11:36] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [02:11:36] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [02:11:36] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:44] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:11:44] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [02:11:44] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 27.28 ms [02:11:45] RECOVERY - Host analytics1024 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:12:20] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:12:29] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:24:02] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [02:25:22] !log LocalisationUpdate completed (1.21wmf6) at Tue Dec 18 02:25:22 UTC 2012 [02:25:32] Logged the message, Master [02:29:53] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:23] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [02:37:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:45:58] !log LocalisationUpdate completed (1.21wmf5) at Tue Dec 18 02:45:58 UTC 2012 [02:46:08] Logged the message, Master [03:24:02] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [04:13:59] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:21:56] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [04:51:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:57:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:59] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:06:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.364 seconds [05:41:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:54] RECOVERY - MySQL Replication Heartbeat on es1009 is OK: OK replication delay 0 seconds [05:52:21] RECOVERY - MySQL Replication Heartbeat on es1010 is OK: OK replication delay 0 seconds [05:56:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.918 seconds [06:24:07] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [06:29:54] !log rsynced all labs homedirs to gluster volumes [06:30:05] Logged the message, Master [06:30:30] !log switched all labs instances to mount /home via gluster on next reboot [06:30:38] Logged the message, Master [06:31:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:51] @instance-info nova-precise1 [06:38:57] ugh [06:44:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.007 seconds [07:13:45] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:18:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [07:34:36] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.006 second response time on port 11000 [08:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.492 seconds [08:26:13] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:40] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [08:29:40] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:29:40] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [08:31:55] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [08:35:00] New review: J; "> Why is this contained to the production realm? We should have it in betalabs too." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/38307 [08:37:03] hey j^ :) [08:55:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:46] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [09:05:40] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 211 seconds [09:10:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [09:13:58] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:20] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38910 [09:23:20] Danny_B|backup: ^ [09:25:13] !log nikerabbit synchronized wmf-config/CommonSettings.php 'Bug 43075' [09:25:22] Logged the message, Master [09:25:31] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:31] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:29] hello [09:40:23] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:42:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:47:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:55:54] apergos: around? [09:56:00] yes [09:56:40] paravoid: [09:57:28] I'm surprised that you said that we're looking at the end of January [09:57:31] is it that bad? [09:57:34] yes [09:57:48] it takes 3.5 days at best to complete an objct replication run [09:58:10] I'm taking them out and putting them in at 33/66/100 [09:58:27] right now we're behind because ms-be1 is playing catchup after its outage too [09:58:31] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:58:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [09:59:06] I'm not basing this off the etas, but off of actual run times [09:59:25] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:03:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:18:10] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [10:23:52] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:32:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:47:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [10:49:29] New patchset: Hashar; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [10:49:44] !g 39060 [10:49:44] https://gerrit.wikimedia.org/r/#q,39060,n,z [10:50:06] New review: Hashar; "rebased on top of https://gerrit.wikimedia.org/r/#/c/39060/" [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/32167 [10:50:09] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:51] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [10:59:45] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [11:16:06] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:19:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [11:42:54] New patchset: Hashar; "find filename based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [11:43:16] New review: Hashar; "I have extracted the MWRealm files in https://gerrit.wikimedia.org/r/39191 . They need to be improve..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/32167 [12:06:38] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [12:08:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:20] New patchset: Hashar; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [12:17:26] New review: Hashar; "PS2:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39191 [12:24:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [12:25:14] New patchset: Hashar; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [12:25:42] New review: Hashar; "PS3 makes it so the realm takes precedence over datacenter file when both choices are availables." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39191 [12:26:37] out for lunch [12:26:40] brb :) [12:33:38] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [12:35:08] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:50] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [12:44:08] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [12:57:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.550 seconds [13:33:41] New patchset: Demon; "Whitelist some more mimetypes for Gerrit to trust" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39196 [13:45:29] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:50] !log restarted puppet on gallium (some apt-get process was a zombie) [13:50:58] Logged the message, Master [13:51:11] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [13:56:17] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:00:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [14:06:00] !g Ie2dd947c5345cdbcb0d7585f16928dbb7a980d2b [14:06:00] https://gerrit.wikimedia.org/r/#q,Ie2dd947c5345cdbcb0d7585f16928dbb7a980d2b,n,z [14:08:28] New patchset: Hashar; "sort hashes when expanding templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39201 [14:13:48] <^demon> !log restarting gerrit on manganese to pick up VRIF+2 [14:13:56] Logged the message, Master [14:14:29] ^demon: worked :) [14:14:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:14:54] !log restarting Zuul with https://gerrit.wikimedia.org/r/39082 so it starts voting Verified+2 [14:15:02] Logged the message, Master [14:15:39] done [14:16:02] <^demon> Ok, gerrit restarted. All-Projects and mediawiki/* updated with new acls. [14:16:10] <^demon> Doing the acls on the 8 misc. extensions now [14:16:28] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39201 [14:17:05] ^demon: apparently it does +2 now https://gerrit.wikimedia.org/r/#/c/39201/ [14:17:08] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [14:18:08] <^demon> Extensions updated. [14:22:41] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [14:23:55] hmm [14:24:08] I just thought of something, adding a new event in Gerrit to simply retrigger tests [14:24:28] event would be sent whenever someone push the "Recheck" button on the patchset :-] [14:26:34] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39196 [14:30:20] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:30:52] I found an easy change to CR+2 ;) [14:32:57] <^demon> Can we have it skip doing CR=0 so it doesn't say "No score" on every change? [14:33:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:29] ^demon: CR+2 merged the change with V+2 : https://gerrit.wikimedia.org/r/#/c/39027/ [14:35:47] <^demon> Yay, it worked! [14:36:02] ^demon: CR=0 is used to reset the CR-2 flag jenkins set whenever unit test fails [14:36:38] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [14:37:14] ^demon: do you have an example ? [14:38:48] <^demon> In the change you just gave, 39027, it assigned a CR=0 when it gave VRIF+2. [14:39:16] <^demon> Probably won't be a big deal. [14:41:27] ^demon: I guess the deployment works fine [14:42:14] <^demon> Glad it went so smoothly. I think people will like having the CRVW category for humans only again. [14:42:23] yup [14:42:35] I did the CR+1 to make it obvious to people that Jenkins linted the change [14:42:44] ended up not being a good idea after all [14:47:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.414 seconds [14:51:03] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.012 second response time on port 11000 [14:52:41] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:02] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:30] <^demon> hashar: I updated wikitech-l that we're done and everything seems good. [14:58:49] \O/ [14:59:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:02:44] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [15:03:27] Nikerabbit, exception log is full of "Exception from line 356 of /usr/local/apache/common-local/php-1.21wmf6/includes/cache/MessageCache.php: MessageCache failed to load messages" [15:03:35] is it something to worry about? [15:06:20] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [15:08:30] New patchset: Cmjohnson; "Adding solr1001-1003 to dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39206 [15:08:39] whee [15:09:57] robh: wanna check my spelling and spacing plz ^^ [15:12:18] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39206 [15:12:52] cmjohnson1: done, not merged on sockpuppet, thats all you [15:13:26] haaaai paravoid, you around? [15:16:39] New patchset: Cmjohnson; "Adding to solr servers to netboot using lvm.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39208 [15:17:51] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39208 [15:19:53] !log Load-testing spatial search [15:20:02] Logged the message, Master [15:21:23] Before switching off yttrium, I gonna have some data for perf comparison:) [15:23:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:10] MaxSem: yes and no [15:30:31] MaxSem: it has failed before causing nasty side effects like sidebar reverting to default value or no gadgets [15:30:58] MaxSem: if it is happening very often it should be looked into [15:31:23] Nikerabbit, exception log currently consists mostly of this exception [15:31:52] as in, many times per second or few every minute? [15:33:38] as in, 56 of this and only 2 of any other exception in the last 1000 lines [15:35:29] try /home/maxsem/exceptionmonitor on fluorine [15:36:26] what's that [15:36:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.197 seconds [15:38:05] fluorine? [15:38:37] <^demon> It's where all the mw logs are collected. [15:38:53] *what's /home/maxsem/exceptionmonitor [15:39:04] <^demon> It's a script that MaxSem wrote on fluorine. [15:39:21] <^demon> A script we never puppetized when Roan wrote the same basic thing ;-) [15:39:45] ^demon, it's now waiting in Gerrit [15:40:02] <^demon> There was a reason we didn't puppetize it before. [15:40:22] ? [15:40:49] <^demon> Something about "watching a log every 2 seconds isn't great" or somesuch. [15:40:52] <^demon> Roan would remember. [15:41:42] you can do it in a blocking way, instead of polling [15:42:03] heh, it's a tweak of fatalmonitor which is puppetized:P [15:42:17] <^demon> I thought fatalmonitor wasn't puppetized. [15:42:26] <^demon> Oh well, I don't terribly care that much :) [15:44:46] so how does that differ from current fatalmonitor? [15:45:55] !log Testing done, 40 concurrent processes hitting around the worst-case point kept the load on yttrium at 20%. Average response time ~430ms [15:46:03] Logged the message, Master [15:46:23] that's much better than I expected [15:51:33] Nikerabbit, fatalmonitor tracks warnings/fatals from apache logs, exceptionmonitor tracks exceptions [16:02:01] I see [16:02:36] MaxSem: would be nice to get someone to debug why it fails from time to time, it shouldn't [16:04:41] Nikerabbit, http://dpaste.org/mU3yO/ [16:05:26] ottomata1: yes [16:05:50] hiya, just a general q for you, since I was thinking about it twice this week [16:06:00] what do you think about upstart vs supervisor? [16:06:32] I don't know supervisor well but I think it fills another space [16:06:40] upstart seems more standard, but supervisor is easier for development [16:06:41] yeah [16:06:58] what do we need supervisor fro? [16:07:00] for [16:07:04] supervisor is more intrusive, upstart works like service scripts [16:07:08] well, i'm not sure we need it, [16:07:20] was just wondering what you thought, i was recommending to ori that he use upstart instead of supervisor [16:07:44] but supervisor can handle log files a bit better than upstart (I think), without having to manually pipe stderr/stdout around and set up log rotate [16:07:58] also, for a sec I thought ryan F was asking for something that needed it, but it turns out he's not [16:08:03] I really think we should do all that though [16:08:45] yeah, i think for producitiony services, we should for sure, but what if people just want to test out their processes as daemons while developing? [16:08:47] I'm not a big fan of supervisor and programs like it [16:08:55] ok cool [16:09:12] yeah, was just wondering, i don't need it at all now, since it turns out ryan's thing doesn't need it [16:09:44] but I don't have very strong feelings I should say [16:10:01] it just feels strange [16:11:44] MaxSem: there is something like 5s timeout there [16:12:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:59] yeah i know what you mean, using it is a little weird, cause you have to use supervisorctl, rather than service [16:19:04] which also makes it more difficult to puppetize [16:19:29] I was going to think about implementing a supervisor service provider if we had a use for it, but I don't think we do right now [16:25:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.566 seconds [16:28:32] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:38:37] New patchset: Faidon; "partman: add flavor for Ceph boxes with SSDs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39214 [16:38:38] New patchset: Faidon; "partman: get rid of yet another prompt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39215 [16:38:38] cmjohnson1: around? [16:38:47] apergos: too [16:39:02] yep [16:39:02] yes but leaving in about 5 minutes [16:39:06] what's up? [16:39:18] so the 720xds in eqiad don't have console redirection enabled [16:39:34] I'm not sure if that's the case in pmtpa as well (hence the ping to apergos) [16:40:10] New review: Faidon; "Painfully iterated and tested." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/39214 [16:40:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39214 [16:40:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39215 [16:41:34] paravoid: k...will check and fix...will probably have to reboot anything objections? [16:41:41] any objections [16:41:50] I did 1001 and 1002 [16:42:14] I can do more, if you have spare cycles we can probably split them up [16:42:18] if not, I'll do all of them [16:42:28] I;m not sure what it looks like when it doesn't have console redirection working [16:42:52] what's more important is to have it on new boxes, it took me quite a while to manage to convince Java webstart to run the iDRAC console [16:43:09] apergos: you don't see the BIOS [16:43:14] or anything before Linux [16:43:37] well all the boxes I have touch so far are set up properly then [16:43:57] without that I would be a dead duck as I always have to go into the bios for something or other on these boxes [16:44:56] paravoid: did you check any of the others first? 1001 and 1002 were setup first [16:45:18] haha [16:45:21] no, just those two [16:45:29] might be my lucky day... :) [16:45:40] i am on 3...try 1004 [16:46:26] if not I will take care of it [16:46:36] well, we have to go into BIOS anyway [16:46:40] for the power mgmt thing [16:47:19] good point..so i will fix as necessary [16:49:43] ok, heading out, talk to folks later [16:59:08] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:41] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:13:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.897 seconds [17:15:02] PROBLEM - SSH on ms-be1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:11] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:25:39] New patchset: Mark Bergsma; "Stop threads from falling asleep" [operations/software] (master) - https://gerrit.wikimedia.org/r/39219 [17:47:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:53] RECOVERY - Host ms-be1003 is UP: PING WARNING - Packet loss = 28%, RTA = 26.65 ms [18:03:11] PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: Connection refused by host [18:03:12] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: Connection refused by host [18:03:20] PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: Connection refused by host [18:03:21] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: Connection refused by host [18:03:21] PROBLEM - swift-container-server on ms-be1003 is CRITICAL: Connection refused by host [18:03:30] PROBLEM - swift-object-server on ms-be1003 is CRITICAL: Connection refused by host [18:03:30] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [18:03:30] PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: Connection refused by host [18:03:56] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: Connection refused by host [18:03:57] cmjohnson1: fwiw, ms-be1003 was also like that [18:04:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [18:04:06] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: Connection refused by host [18:04:23] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: Connection refused by host [18:04:25] yep..i see what happened...the default is now set to not redirect [18:04:33] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: Connection refused by host [18:04:33] PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: Connection refused by host [18:17:12] hashar: hey, are you about? [18:17:35] notpeter: kind of here though not really :-D [18:17:53] currently watching my wife trying to feed our daughter [18:17:55] cmjohnson1: you're doing ms-be1003 now, aren't you [18:18:05] quick question: I never ran apache-graceful-all yesterday. is it still needed? [18:18:16] cmjohnson1: I did it already, had the console open and I'm watching you do it [18:18:56] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:02] New patchset: Ottomata; "Adding $template_variables parameter to udp2log::instance." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39225 [18:19:26] notpeter: if the apache conf deployed in production are ok, I guess gracefulling will be fine [18:19:40] hashar: ok, cool. will do now [18:19:45] notpeter: srv193 did not complain any more about the apc.shm_size so I think it is fine [18:19:49] meeester notpeter, could you review this for me real quick? [18:19:53] https://gerrit.wikimedia.org/r/#/c/39225/ [18:19:56] should be a quick one [18:20:04] i could push it myself but it'd be better if someone else said ok [18:20:08] notpeter: at worth, the setting will be fixed whenever someone graceful the apaches for some reason [18:20:08] shouldn't affect anything [18:20:09] hashar: yeah, I'm going to test one by hand, and then graceful all [18:20:14] ottomata: sure! [18:20:15] paravoid: sorry...i realized once I booted into it you had done it [18:20:18] notpeter: sounds like a good idea :D [18:20:29] cmjohnson1: that's okay, my bad for not saying so :) [18:20:36] hashar: I'm wildly risk-averse :) [18:21:24] 1004,1008,1012 are fixed [18:21:27] ottomata: is that var/array/whatever supposed to be used anywhere yet? [18:21:39] no [18:21:45] ok! [18:22:29] notpeter: good to know :-] I am off again! [18:23:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39225 [18:24:11] danke, i'll sockpuppet it [18:24:14] ok [18:28:05] PROBLEM - Host analytics1021 is DOWN: CRITICAL - Network Unreachable (10.64.36.121) [18:28:05] PROBLEM - Host analytics1020 is DOWN: CRITICAL - Network Unreachable (10.64.36.120) [18:28:05] PROBLEM - Host analytics1012 is DOWN: CRITICAL - Network Unreachable (10.64.36.112) [18:28:05] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Network Unreachable (10.64.36.122) [18:28:05] PROBLEM - Host analytics1017 is DOWN: CRITICAL - Network Unreachable (10.64.36.117) [18:28:14] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Network Unreachable (10.64.36.111) [18:28:14] PROBLEM - Host analytics1016 is DOWN: CRITICAL - Network Unreachable (10.64.36.116) [18:28:14] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Network Unreachable (10.64.36.126) [18:28:14] PROBLEM - Host analytics1014 is DOWN: CRITICAL - Network Unreachable (10.64.36.114) [18:28:15] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Network Unreachable (10.64.36.127) [18:28:15] PROBLEM - Host analytics1025 is DOWN: CRITICAL - Network Unreachable (10.64.36.125) [18:28:15] PROBLEM - Host analytics1015 is DOWN: CRITICAL - Network Unreachable (10.64.36.115) [18:28:15] PROBLEM - Host analytics1013 is DOWN: CRITICAL - Network Unreachable (10.64.36.113) [18:28:16] PROBLEM - Host analytics1023 is DOWN: CRITICAL - Network Unreachable (10.64.36.123) [18:28:16] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Network Unreachable (10.64.36.119) [18:28:17] PROBLEM - Host analytics1018 is DOWN: CRITICAL - Network Unreachable (10.64.36.118) [18:28:17] PROBLEM - Host analytics1024 is DOWN: CRITICAL - Network Unreachable (10.64.36.124) [18:28:21] uh oh [18:28:23] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [18:28:23] PROBLEM - Host es1009 is DOWN: CRITICAL - Network Unreachable (10.64.32.19) [18:28:24] PROBLEM - Host es1010 is DOWN: CRITICAL - Network Unreachable (10.64.32.20) [18:28:30] LeslieCarr: ping [18:28:32] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [18:28:45] paravoid: oh [18:28:49] danke [18:29:05] New patchset: Anomie; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [18:29:50] New review: Anomie; "PS4: Mirror changes from MWRealm.php in MWRealm.sh" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39191 [18:30:30] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [18:30:38] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [18:30:38] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [18:30:38] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [18:30:38] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [18:31:24] working on this [18:31:28] ty [18:31:34] !log asw-c-eqiad ae bundles went down, working on fixing [18:31:43] Logged the message, Mistress of the network gear. [18:32:21] cmjohnson1: are you doing the rest or should I go ahead and do ms-be1004? [18:32:48] LeslieCarr: need help? [18:32:52] nearly finished 1010 and 1011 are last 2 [18:32:59] paravoid ^ [18:33:01] oh, cool [18:33:04] thanks a lot! [18:33:33] mark: turning the ports on and off of the non working ae bundle killed the working one :( i'm on a meeting with ATAC now [18:33:39] sorry took a little longer to get to than i thought [18:33:55] ok [18:34:50] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [18:34:50] RECOVERY - Host analytics1023 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [18:34:51] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [18:34:51] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:34:59] RECOVERY - Host es1009 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [18:34:59] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [18:35:00] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.73 ms [18:35:00] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [18:35:09] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [18:35:09] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [18:35:09] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.70 ms [18:35:09] RECOVERY - Host analytics1025 is UP: PING OK - Packet loss = 0%, RTA = 26.74 ms [18:35:17] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [18:35:27] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [18:35:27] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [18:35:28] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 26.92 ms [18:35:35] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [18:35:36] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [18:35:36] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [18:35:36] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [18:36:02] RECOVERY - Host analytics1024 is UP: PING OK - Packet loss = 0%, RTA = 26.70 ms [18:36:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:54] LeslieCarr: does it help to convert one of the two bundles to non-bondle, just a single port? [18:37:24] New patchset: Dzahn; "add mflaschen and mholmquist keys and to mortals group per RT-4114/4115" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39228 [18:37:45] * marktraceur giggles with girlish glee [18:37:59] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:38:16] haha [18:39:53] mark: yeah, the bundles will come up then, or if they are in a static lag [18:39:54] :( [18:40:50] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [18:41:10] i've avoided LACP for a long time because of weird shit like this [18:41:15] especially inter-vendor [18:41:23] but I figured with just j, it should work :P [18:41:34] hahaha good one [18:41:52] i've seen it just work on junier and juniper <-> force10 for years and years [18:42:17] gave the guy the configs, he's reproducing in a lab [18:43:08] New review: MarkTraceur; "I feel so....mortal...." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/39228 [18:44:08] PROBLEM - swift-object-server on ms-be1010 is CRITICAL: Connection refused by host [18:44:17] PROBLEM - swift-account-replicator on ms-be1010 is CRITICAL: Connection refused by host [18:44:35] PROBLEM - swift-container-updater on ms-be1010 is CRITICAL: Connection refused by host [18:44:35] PROBLEM - swift-object-updater on ms-be1010 is CRITICAL: Connection refused by host [18:44:35] PROBLEM - SSH on ms-be1010 is CRITICAL: Connection refused [18:44:44] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [18:44:44] PROBLEM - swift-container-server on ms-be1010 is CRITICAL: Connection refused by host [18:44:53] PROBLEM - swift-account-server on ms-be1010 is CRITICAL: Connection refused by host [18:45:29] PROBLEM - swift-object-replicator on ms-be1010 is CRITICAL: Connection refused by host [18:45:29] PROBLEM - swift-container-replicator on ms-be1010 is CRITICAL: Connection refused by host [18:45:38] PROBLEM - swift-account-reaper on ms-be1010 is CRITICAL: Connection refused by host [18:45:38] PROBLEM - swift-object-auditor on ms-be1010 is CRITICAL: Connection refused by host [18:45:39] PROBLEM - swift-account-auditor on ms-be1010 is CRITICAL: Connection refused by host [18:45:39] PROBLEM - swift-container-auditor on ms-be1010 is CRITICAL: Connection refused by host [18:52:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [19:01:28] PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: No response from NTP server [19:04:46] PROBLEM - NTP on ms-be1010 is CRITICAL: NTP CRITICAL: No response from NTP server [19:15:31] What did we have before MariaDB? 5.1-fb was a patched version of MySQL? [19:17:34] i'm 95% certain we haven't switched to mariadb on all of our databases, only some [19:17:35] but yes [19:19:22] LeslieCarr: Sorry, yes, just a percentage at the moment. "yes" -> "patched version of MySQL" ? [19:19:37] both :) [19:20:32] we were running mysqlatfacebook [19:20:43] well, still are as Leslie said [19:21:33] (that's Facebook's variant of MySQL) [19:22:23] Hello, I am going to install a new table on test/test2/mediawiki, is there a standard procedure(scripts) to install it? Thanks [19:24:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:31] bsitu, mwscript sql.php --wiki=<...> path/to/file.sql [19:32:41] Thanks, MaxSem [19:39:55] <^demon> sql.php does things like add $wgDBTablePrefix and $wgDBOptions and stuff to your sql. [19:40:03] <^demon> (Otherwise you might end up with MyISAM tables ;-)) [19:40:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [19:42:51] robh: ping [19:42:55] ^demon, heh - I recently managed to end up with MyISAM even using sql.php. who could have thought there could be a typo in variable name?:) [19:43:30] <^demon> Well, when variables are hidden in comments, anything can happen :) [19:49:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:49:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:57:13] Ryan_Lane: In case it matters… I am WFH today and also skipping out around 3. [19:57:24] andrewbogott: that's cool [19:58:01] taking a ride on the scenic Vallejo ferry [20:00:56] ooo where are you goingin vallejo ? [20:02:10] New patchset: Bsitu; "Configuration change for Echo extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39232 [20:02:40] LeslieCarr: Just to the wharf and back. Meeting a guy there for a shady craigslist purchase [20:03:12] LeslieCarr: Are there things in Vallejo worth visiting? Besides the boat ride itself? [20:04:15] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39232 [20:05:22] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [20:13:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:07] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [20:26:26] New patchset: Aaron Schulz; "Set captcha backend for testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39234 [20:26:31] !log bsitu synchronized wmf-config/InitialiseSettings.php [20:26:39] Logged the message, Master [20:27:01] !log bsitu synchronized wmf-config/CommonSettings.php [20:27:09] Logged the message, Master [20:27:28] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39234 [20:27:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.808 seconds [20:40:19] New patchset: Bsitu; "Set email notification to true by default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39284 [20:43:49] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39284 [20:44:25] !log Zuul: applying "filters events by user email" to our Zuul deployment https://review.openstack.org/#/c/17609/ [20:44:33] Logged the message, Master [20:44:49] !log running puppet on gallium. [20:44:55] !log gracefulling all apaches to pick up https://gerrit.wikimedia.org/r/#/c/38521/ (tested good on srv193) [20:44:58] Logged the message, Master [20:45:07] Logged the message, notpeter [20:45:51] one day I will have to investigate why puppet takes soooo long :D [20:46:57] hashar: here's a clue: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=stafford.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [20:47:19] notpeter: that is the gracefulling ? [20:47:30] py is doing a graceful restart of all apaches [20:47:35] notpeter: doh :/ sorry about that [20:47:42] !log py gracefulled all apaches [20:47:51] Logged the message, Master [20:48:00] hashar: no, that's why puppet is a PoS [20:48:06] it gets slammed for lon periods of time [20:48:25] ah sorry I thought you were showing me a huge spike on the app servers [20:48:37] nah [20:48:48] so, I got this System failed sanity check: VIP not configured on lo [20:48:51] from every apache [20:48:51] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39161 [20:48:53] !log bsitu synchronized wmf-config/CommonSettings.php 'Update Echo config file' [20:48:55] is this a known thing? [20:49:01] Logged the message, Master [20:53:26] New patchset: Ottomata; "Rsyncing event logs from vanadium into /a/eventlogging/archive" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39309 [20:53:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39309 [20:59:11] bsitu, are you done with your deployment? [20:59:23] MaxSem: not yet [20:59:50] bsitu, please ping me when you will be [21:01:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:01:41] MaxSem: I will, thx [21:04:26] New patchset: Bsitu; "Turns off Echo on test and test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39311 [21:04:45] New review: Dzahn; "mholmquist - verified via IRC cloak and Gerrit login" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/39228 [21:04:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39228 [21:04:56] Yeeees. [21:05:26] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39311 [21:05:59] marktraceur: now puppet needs to run .. i will update the ticket once it should work [21:08:18] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Turns off Echo temporarily on test and test2' [21:08:26] Logged the message, Master [21:08:35] AOK! [21:08:53] MaxSem: you can go ahead with your deployment [21:09:33] bsitu, thanks!:) [21:11:25] New patchset: Nemo bis; "(bug 43240) Add localised logos for Wiktionaries without one" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39312 [21:15:33] New review: Nemo bis; "I'm trying to make the diff more readable..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39312 [21:18:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [21:18:52] marktraceur: try now:) [21:20:21] mutante: Win! Thanks! [21:20:57] mutante: Is it also set up on other boxen? I couldn't trivially ssh to tin, but maybe I have to do something special for ssh-agent? [21:22:19] marktraceur: well..not really, the request said "root@fenari:/home/mholmquist/.ssh# cat authorized_keys [21:22:22] # HEADER: This file was autogenerated at Tue Dec 18 21:17:33 +0000 2012 [21:22:26] # HEADER: by puppet. While it can still be managed manually, it [21:22:28] # HEADER: is definitely not recommended. [21:22:31] arg, wrong clipboard [21:22:45] marktraceur: it said "shell access as a deployer." that would just be fenari so far [21:22:54] Ohhh [21:23:06] mutante: I'm deploying for parsoid, so I need to use tin for a git-deploy base [21:23:41] marktraceur: ok..i understand..let me clarify on the request [21:23:51] *nod* no problem [21:25:10] robh: can you check my work before i commit [21:25:31] checkin [21:28:02] cmjohnson1: looks good [21:28:46] cool...thx robh [21:28:58] New patchset: Nemo bis; "(bug 43240) Add localised logos for Wiktionaries without one" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39312 [21:32:14] !log authdns-update new dns entries for frack bastion host (tellurium) [21:32:21] Logged the message, Master [21:36:55] is there a way to determine what's causing a Varnish to 503 with a guru mediatation? [21:37:08] marktraceur: i see your home dir on tin now ... [21:37:24] i thought first it would not be created. .but it was..just took a little longer [21:37:53] mutante: So it may be resolved anyway? [21:38:15] yes [21:38:26] mutante: Done! Thanks! [21:38:34] you just need to forward your key [21:38:43] well..or better.. use ProxyCommand in your ssh config [21:38:58] to get to tin without even having to forward it [21:39:00] I enabled agent forwarding [21:39:22] Which seems to do the job [21:40:59] marktraceur: that does the job..this is even better https://labsconsole.wikimedia.org/wiki/Help:Access#Using_ProxyCommand_ssh_option [21:41:13] like put tin in your .ssh/config once ... [21:41:27] and then just "ssh tin" from your localbox or something everytime in the future [21:41:39] I suppose so [21:41:48] you should get to tin in one step that way [21:41:56] I may also be deploying for VisualEditor, depending on the day. So both are useful. [21:45:05] New review: Dereckson; "Congratulations to for this work!" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39312 [21:45:51] it's also more secure because your key does not even have to be handled by the bastion host [21:46:03] but yeah, you can use either [21:47:19] mutante: I'm not sure I understand how the proxycommand is translated for production servers, it's not obvious where I should put tin and fenari and so on [21:48:33] New review: Nemo bis; "Yes, I'll remove the trailing ws (I'm still triple-checking the diff)." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39312 [21:49:33] Host tin [21:49:50] ProxyCommand ssh -a -W %h:%p mholmquist@fenari.wikimedia.org [21:49:58] User mholmquist [21:50:08] marktraceur: <-- .. and then just "ssh tin" should work [21:51:15] Sweet. Thanks so much mutante! [21:51:19] yw [21:51:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:04:39] heya, anybody know why i'm getting a bunch of RT email updates? [22:04:44] mutante? ^ [22:04:52] i can't find settings in RT to shush them [22:04:56] ottomata: because people are working on your tickets? [22:05:03] naw, they are ones unrelate dto me [22:05:03] is that bad?:) [22:05:08] this started happening a week ago [22:05:10] give me a ticket number [22:05:16] 4154 [22:05:22] also 4153 [22:05:26] 4151, [22:05:32] 4150, 4149, 4148 [22:05:42] more and more [22:06:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [22:07:03] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [22:07:40] ottomata: somebody added you to ops :) [22:07:48] !log aaron synchronized php-1.21wmf6/extensions/ConfirmEdit/captcha.py [22:07:52] well, there is a BCC: field [22:07:57] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:01] you are just getting what all of us get [22:08:01] Logged the message, Master [22:08:23] it's the default BCC: to keep people updated whats going on [22:08:32] i suggest you filter it in your client [22:08:44] hmmm, ok [22:09:04] look at one of the tickets. you see those "Outgoing mail recorded" messages, right [22:09:15] if you hit Show next to that you can see mail headers [22:09:22] and how you appear in BCC: list [22:09:36] i am not aware though who exactly added you to that when [22:10:28] hm, ok [22:10:48] any tips for differnentiating non bcc ones in the client? [22:11:48] ottomata: ok, confirmed, you were in engineering before, now you are ops. welcome :) [22:11:54] ha, thanks :) [22:12:10] ottomata: you can filter by your own name being in To: [22:12:24] and make those "important" or let it add a flag [22:12:40] hmm, i guess To: or Cc: [22:12:45] or just filter them all and check the web ui? [22:13:11] you can build yourself a custom dashboard to perfection:) [22:17:15] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:19:21] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.485 second response time [22:20:33] New patchset: Nemo bis; "(bug 43240) Add localised logos for Wiktionaries without one" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39312 [22:22:03] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.645 second response time [22:24:27] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:53] csteipp: another one https://gerrit.wikimedia.org/r/#/c/39320/ [22:27:18] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:21] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [22:38:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:38:44] New patchset: Nemo bis; "(bug 43240) Add localised logos for Wiktionaries without one" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39312 [22:40:23] scapping... [22:41:51] New review: Nemo bis; "Should be ok now: I've added the comment where missing in the first ones, checked for wrong removal ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/39312 [22:42:09] New patchset: Aaron Schulz; "Set the captcha-render container directory." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39323 [22:42:57] !log deploying Andrew Otto's group changes to OpenStackManager to labsconsole [22:43:03] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39323 [22:43:05] Logged the message, Master [22:43:56] !log aaron synchronized wmf-config/filebackend.php 'set captcha directory.' [22:44:04] Logged the message, Master [22:45:16] New patchset: Aaron Schulz; "Switched captcha dir for test2wiki, reverted to old one for testwiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39324 [22:45:43] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39324 [22:46:26] !log aaron synchronized wmf-config/CommonSettings.php 'switched new captcha setting from testwiki -> test2wiki' [22:46:34] Logged the message, Master [22:46:37] awjr: ^ [22:46:47] ooo [22:46:50] thanks AaronSchulz [22:47:01] AaronSchulz: looks like that worked [22:50:56] notpeter: can you make /mnt/upload7/private/captcha (and subdirs/files) owned by apache? [22:51:16] * AaronSchulz loves when stuff is randomly rooted over [22:52:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.974 seconds [22:54:12] !log maxsem Started syncing Wikimedia installation... : https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-18 [22:54:21] Logged the message, Master [22:57:18] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.969 second response time [22:59:35] !log temp putting ganglia.w.o behind htaccess for sec reasons [22:59:44] Logged the message, notpeter [22:59:50] AaronSchulz: sure [23:00:05] heh, I was 'bout to ping mutant :) [23:00:18] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.553 second response time [23:00:19] *mutante [23:00:24] heh [23:00:44] AaronSchulz: apache: good? [23:01:05] or a specific group? [23:01:26] AaronSchulz: [23:01:27] AaronSchulz: [23:01:28] AaronSchulz: [23:01:29] AaronSchulz: [23:01:31] heh [23:01:31] AaronSchulz: [23:01:37] let me check the others [23:01:50] ottomata: your group change has been deployed [23:01:55] I'm going to geuess apache:wikidev [23:02:16] and is working well [23:02:21] notpeter: I was going to guess apache ;) [23:02:26] it seems to be all over the map [23:02:33] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:34] I'll just do apache: [23:02:37] I'm going to make an ldap backup and run the maintenance script :) [23:02:40] it won't matter for my purpose though [23:02:42] ok [23:03:09] heh, so I was wondering wtf captcha2 has, and it seems to have the bad word list [23:03:15] * AaronSchulz will merge it with his own [23:04:02] AaronSchulz: it's going... slowly [23:04:26] well there are ~150,000 files [23:04:36] not a huge amount but it will take a bit [23:04:49] or maybe less, at least that's what tim said ;) [23:05:15] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:05:42] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.516 second response time [23:06:34] still going [23:06:49] notpeter: where do we get our ganglia login? [23:06:57] /home/w/docs [23:06:58] sorry [23:07:01] np [23:07:04] was about to email [23:07:25] ah cool. I was queried in the meantime, hadn't been paying attention [23:10:48] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:04] AaronSchulz: still going [23:11:26] done! [23:11:28] AaronSchulz: ^ [23:11:55] looks good [23:16:48] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.394 second response time [23:16:51] New patchset: Pyoungmeister; "temp putting ganglia behind htaccess" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39328 [23:20:05] notpeter: thinking about, apache is definitely the best choice, since it's hard for people to mess up perms running shell scripts as themselves [23:20:31] * AaronSchulz should probably also enable the MW check for that [23:20:35] meh, someday [23:21:54] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:07] New patchset: Pyoungmeister; "temp putting ganglia behind htaccess" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39328 [23:24:14] AaronSchulz: ah, that is true [23:24:40] most of those people might be me [23:24:51] hehehehehe [23:25:40] New patchset: Pyoungmeister; "temp putting ganglia behind htaccess" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39328 [23:26:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39328 [23:27:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:31:09] wooo thanks Ryan! [23:31:16] thank you ;) [23:31:44] so that means that we can grant file access by adding people to projects in labsconsole, right? [23:33:07] yes, but we'll also need to get groups working with the hadoop web stuff too, right? [23:40:28] !log maxsem Finished syncing Wikimedia installation... : https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2012-12-18 [23:40:36] Logged the message, Master [23:40:38] heh [23:41:42] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.310 second response time [23:42:02] well, i think it just uses ldap to do whatever it does, so i'm not sure [23:42:08] Ryan_Lane ^ [23:42:09] not sure [23:42:12] i will look more into it tomorrow [23:42:55] ok [23:43:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [23:46:08] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:01] can someone please flush Varnish? [23:49:25] * AaronSchulz years a toilet sound in his head [23:56:02] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.971 second response time [23:57:34] New patchset: Ryan Lane; "Don't set the pki directory explicitly for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39330 [23:58:53] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.593 second response time