[00:02:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:04:50] New review: Hashar; "Ooops :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39158 [00:05:07] hehe [00:05:09] both merged fwiw [00:05:20] works for me [00:05:23] thanks ! [00:06:41] I will still do my changes in puppet anyway. Ops reviews are priceless. [00:10:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:50] !log temp stopping lsave on es1009 and es1010 for upcoming networking downtime [00:11:58] Logged the message, notpeter [00:14:37] New patchset: Ori.livneh; "rsync eventlogging logs to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39161 [00:15:20] PROBLEM - MySQL Replication Heartbeat on es1009 is CRITICAL: CRIT replication delay 231 seconds [00:16:41] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on es1009 is CRITICAL: CRIT replication delay 231 seconds peter for network downtime [00:16:41] PROBLEM - MySQL Replication Heartbeat on es1010 is CRITICAL: CRIT replication delay 268 seconds [00:16:53] thank you hashar :) [00:17:11] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on es1010 is CRITICAL: CRIT replication delay 268 seconds peter for network downtime [00:17:27] LeslieCarr: not sure why but you are welcome :D [00:17:28] mutante: I'm ack'ing things just for you :) [00:17:56] notpeter: thanks:) [00:18:01] for the ops reviews are priceless [00:18:14] notpeter: some day i want that to be a bot feature [00:18:16] kind of stating the obvious :-D [00:20:08] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL - No data received from host [00:25:32] hashar: i wanted to give you full access, but already done :) [00:25:38] hashar doesn't need fine-grained sudo on gallium anymore, as he's root. [00:25:49] paravoid added you [00:26:11] mutante: sorry :-] we have been talking about it like half an hour ago [00:26:25] !log starting upgrade of asw-c-eqiad.mgmt - connectivity to row c machines may be affected [00:26:29] i see you wrote the change even ..nice! [00:26:34] Logged the message, Mistress of the network gear. [00:26:45] mutante: ops taught me to use puppet so I do! [00:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:27:22] hashar: awesome:) [00:28:02] hashar: resolved.. cat /etc/sudoers.d/hashar [00:29:53] PROBLEM - Host analytics1018 is DOWN: CRITICAL - Network Unreachable (10.64.36.118) [00:29:53] PROBLEM - Host analytics1020 is DOWN: CRITICAL - Network Unreachable (10.64.36.120) [00:29:54] PROBLEM - Host analytics1025 is DOWN: CRITICAL - Network Unreachable (10.64.36.125) [00:29:54] PROBLEM - Host analytics1024 is DOWN: CRITICAL - Network Unreachable (10.64.36.124) [00:29:54] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Network Unreachable (10.64.36.127) [00:30:02] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [00:30:02] PROBLEM - Host es1009 is DOWN: CRITICAL - Network Unreachable (10.64.32.19) [00:30:03] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [00:30:29] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Network Unreachable (10.64.36.122) [00:30:29] PROBLEM - Host es1010 is DOWN: CRITICAL - Network Unreachable (10.64.32.20) [00:30:38] PROBLEM - Host analytics1017 is DOWN: CRITICAL - Network Unreachable (10.64.36.117) [00:30:38] nagios-wm: are they in the same rack? [00:30:39] PROBLEM - Host analytics1016 is DOWN: CRITICAL - Network Unreachable (10.64.36.116) [00:30:39] PROBLEM - Host analytics1021 is DOWN: CRITICAL - Network Unreachable (10.64.36.121) [00:30:56] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Network Unreachable (10.64.36.111) [00:31:05] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Network Unreachable (10.64.36.119) [00:31:05] PROBLEM - Host analytics1013 is DOWN: CRITICAL - Network Unreachable (10.64.36.113) [00:31:14] PROBLEM - Host analytics1015 is DOWN: CRITICAL - Network Unreachable (10.64.36.115) [00:31:14] PROBLEM - Host analytics1023 is DOWN: CRITICAL - Network Unreachable (10.64.36.123) [00:31:23] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Network Unreachable (10.64.36.126) [00:31:24] PROBLEM - Host analytics1014 is DOWN: CRITICAL - Network Unreachable (10.64.36.114) [00:31:24] PROBLEM - Host analytics1012 is DOWN: CRITICAL - Network Unreachable (10.64.36.112) [00:31:45] ^ dschoon / drdee / ottomata? [00:31:56] that looks lovely. [00:32:05] that's the asw-c-eqiad reboot [00:32:07] oh, probably leslie's upgrade. [00:32:08] that's scheduled [00:32:10] right. sorry. [00:32:12] yep [00:32:14] that's me [00:32:17] mwhahaha [00:32:18] nice. [00:32:24] i didn't like those machines anyway. [00:32:28] oh noes, i forgot about es1007 and es1008 [00:32:31] they're always making me do stuff. [00:32:33] sorry notpeter :-/ [00:32:40] !log fixing fenari permissions for gwicke.. (pre-puppet age UID) [00:32:48] Logged the message, Master [00:32:53] you know, i always thought this job would be easier if we just disabled all the external access [00:32:57] users always cause scaling issues [00:33:02] data you need to analyze [00:33:20] use domas optimizer: // [00:35:24] night time, have a good afternoon [00:35:33] gwicke: 2006, February 13 :) [00:36:26] * Susan smiles at Domas optimizer. [00:36:35] mutante: that was the last change to my permissions? [00:36:40] yeah:) [00:36:47] k ;) [00:37:04] puppet could not update the file, it did not have permissions:) [00:37:11] fixed and running again [00:37:32] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [00:37:32] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [00:37:41] gwicke: welcome back! should work now [00:37:51] gabriel@tosh is in place [00:38:14] Is there a host named "tosh"? [00:38:24] mutante: awesome, thanks! [00:38:32] Susan: yes, my laptop [00:38:34] Susan: somewhere under gwickes control, yea [00:38:40] bonus for guessing the brand ;) [00:38:44] Heh. [00:38:52] gwicke: ooh..i thought Reggae music [00:38:57] Ah, "tosh" is an English Wiktionary in-joke. [00:39:11] hehe [00:39:23] https://en.wiktionary.org/wiki/tosh and related deletion log summaries. [00:39:31] Susan: wiktionary user! yay! [00:39:45] I'm just familiar with its culture. ;-) [00:40:12] my original home wiki i guess [00:40:30] "Stupidity" still being in block drop-down menu is... classic English Wiktionary. [00:40:41] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [00:40:50] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [00:41:43] well this is good…. the upgrade/reboot to fix everything made all lacp stay down [00:42:17] it was only supposed to be a 3 hour tour! [00:57:09] !log asw-c-eqiad unreachable due to lacp issue [00:57:17] Logged the message, Mistress of the network gear. [01:00:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:32] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:38] New review: Dzahn; "stop..breakage ..UID 618 is duplicate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35924 [01:14:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 231 seconds [01:15:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [01:15:21] New patchset: Dzahn; "fix UID for sbernarding to 623, duplicate UID usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39172 [01:15:22] New patchset: Dzahn; "fix UID for maryana to 624, duplicate UID usage (609/csteipp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39173 [01:15:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39172 [01:16:04] New review: Dzahn; "https://gerrit.wikimedia.org/r/39172" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35924 [01:18:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:18:48] New review: Dzahn; "this used a duplicate UID.. same with csteipp ..was 609. fixed in https://gerrit.wikimedia.org/r/#/c..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24008 [01:19:07] LeslieCarr: please install fix_lacp.exe [01:21:28] New review: Dzahn; "https://rt.wikimedia.org/Ticket/Display.html?id=3517" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24008 [01:28:09] !log fixing duplicate UID issue on stat1 for maryana [01:28:19] Logged the message, Master [01:28:34] New review: Dzahn; "fix RT-3517" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/39173 [01:28:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39173 [01:29:49] New review: Tychay; "By the power of Greyskull, I lack the powerz to +2." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/32362 [01:36:29] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:36:46] err: /Stage[main]/Misc::Statistics::Mediawiki/Git::Clone[statistics_mediawiki]/Exec[git_pull_statistics_mediawiki]/returns: change from notrun to 0 failed: git pull --quiet returned 1 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:679 [01:48:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.806 seconds [02:05:26] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [02:11:08] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:08] RECOVERY - Host es1009 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:09] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:11:09] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:17] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:11:17] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [02:11:18] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [02:11:26] RECOVERY - Host analytics1025 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [02:11:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:35] RECOVERY - Host analytics1023 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [02:11:35] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [02:11:36] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 26.97 ms [02:11:36] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [02:11:36] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [02:11:36] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:11:44] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:11:44] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [02:11:44] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 27.28 ms [02:11:45] RECOVERY - Host analytics1024 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:12:20] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:12:29] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [02:24:02] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [02:25:22] !log LocalisationUpdate completed (1.21wmf6) at Tue Dec 18 02:25:22 UTC 2012 [02:25:32] Logged the message, Master [02:29:53] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:23] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.066 second response time [02:37:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:45:58] !log LocalisationUpdate completed (1.21wmf5) at Tue Dec 18 02:45:58 UTC 2012 [02:46:08] Logged the message, Master [03:24:02] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [04:13:59] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:21:56] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [04:51:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:57:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:59] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:06:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.364 seconds [05:41:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:54] RECOVERY - MySQL Replication Heartbeat on es1009 is OK: OK replication delay 0 seconds [05:52:21] RECOVERY - MySQL Replication Heartbeat on es1010 is OK: OK replication delay 0 seconds [05:56:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.918 seconds [06:24:07] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [06:29:54] !log rsynced all labs homedirs to gluster volumes [06:30:05] Logged the message, Master [06:30:30] !log switched all labs instances to mount /home via gluster on next reboot [06:30:38] Logged the message, Master [06:31:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:51] @instance-info nova-precise1 [06:38:57] ugh [06:44:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.007 seconds [07:13:45] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:18:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [07:34:36] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.006 second response time on port 11000 [08:06:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.492 seconds [08:26:13] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:40] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [08:29:40] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [08:29:40] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [08:31:55] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [08:35:00] New review: J; "> Why is this contained to the production realm? We should have it in betalabs too." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/38307 [08:37:03] hey j^ :) [08:55:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:46] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [09:05:40] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 211 seconds [09:10:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [09:13:58] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:20] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38910 [09:23:20] Danny_B|backup: ^ [09:25:13] !log nikerabbit synchronized wmf-config/CommonSettings.php 'Bug 43075' [09:25:22] Logged the message, Master [09:25:31] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:31] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:04] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:29] hello [09:40:23] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:42:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:47:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:55:54] apergos: around? [09:56:00] yes [09:56:40] paravoid: [09:57:28] I'm surprised that you said that we're looking at the end of January [09:57:31] is it that bad? [09:57:34] yes [09:57:48] it takes 3.5 days at best to complete an objct replication run [09:58:10] I'm taking them out and putting them in at 33/66/100 [09:58:27] right now we're behind because ms-be1 is playing catchup after its outage too [09:58:31] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:58:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [09:59:06] I'm not basing this off the etas, but off of actual run times [09:59:25] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [10:03:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:18:10] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [10:23:52] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:32:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:47:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [10:49:29] New patchset: Hashar; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [10:49:44] !g 39060 [10:49:44] https://gerrit.wikimedia.org/r/#q,39060,n,z [10:50:06] New review: Hashar; "rebased on top of https://gerrit.wikimedia.org/r/#/c/39060/" [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/32167 [10:50:09] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:51] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [10:59:45] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [11:16:06] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:19:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [11:42:54] New patchset: Hashar; "find filename based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [11:43:16] New review: Hashar; "I have extracted the MWRealm files in https://gerrit.wikimedia.org/r/39191 . They need to be improve..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/32167 [12:06:38] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [12:08:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:20] New patchset: Hashar; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [12:17:26] New review: Hashar; "PS2:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39191 [12:24:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [12:25:14] New patchset: Hashar; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [12:25:42] New review: Hashar; "PS3 makes it so the realm takes precedence over datacenter file when both choices are availables." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39191 [12:26:37] out for lunch [12:26:40] brb :) [12:33:38] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [12:35:08] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:50] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [12:44:08] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [12:57:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.550 seconds [13:33:41] New patchset: Demon; "Whitelist some more mimetypes for Gerrit to trust" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39196 [13:45:29] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:50] !log restarted puppet on gallium (some apt-get process was a zombie) [13:50:58] Logged the message, Master [13:51:11] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [13:56:17] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:00:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [14:06:00] !g Ie2dd947c5345cdbcb0d7585f16928dbb7a980d2b [14:06:00] https://gerrit.wikimedia.org/r/#q,Ie2dd947c5345cdbcb0d7585f16928dbb7a980d2b,n,z [14:08:28] New patchset: Hashar; "sort hashes when expanding templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39201 [14:13:48] <^demon> !log restarting gerrit on manganese to pick up VRIF+2 [14:13:56] Logged the message, Master [14:14:29] ^demon: worked :) [14:14:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:14:54] !log restarting Zuul with https://gerrit.wikimedia.org/r/39082 so it starts voting Verified+2 [14:15:02] Logged the message, Master [14:15:39] done [14:16:02] <^demon> Ok, gerrit restarted. All-Projects and mediawiki/* updated with new acls. [14:16:10] <^demon> Doing the acls on the 8 misc. extensions now [14:16:28] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39201 [14:17:05] ^demon: apparently it does +2 now https://gerrit.wikimedia.org/r/#/c/39201/ [14:17:08] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [14:18:08] <^demon> Extensions updated. [14:22:41] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [14:23:55] hmm [14:24:08] I just thought of something, adding a new event in Gerrit to simply retrigger tests [14:24:28] event would be sent whenever someone push the "Recheck" button on the patchset :-] [14:26:34] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39196 [14:30:20] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:30:52] I found an easy change to CR+2 ;) [14:32:57] <^demon> Can we have it skip doing CR=0 so it doesn't say "No score" on every change? [14:33:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:29] ^demon: CR+2 merged the change with V+2 : https://gerrit.wikimedia.org/r/#/c/39027/ [14:35:47] <^demon> Yay, it worked! [14:36:02] ^demon: CR=0 is used to reset the CR-2 flag jenkins set whenever unit test fails [14:36:38] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [14:37:14] ^demon: do you have an example ? [14:38:48] <^demon> In the change you just gave, 39027, it assigned a CR=0 when it gave VRIF+2. [14:39:16] <^demon> Probably won't be a big deal. [14:41:27] ^demon: I guess the deployment works fine [14:42:14] <^demon> Glad it went so smoothly. I think people will like having the CRVW category for humans only again. [14:42:23] yup [14:42:35] I did the CR+1 to make it obvious to people that Jenkins linted the change [14:42:44] ended up not being a good idea after all [14:47:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.414 seconds [14:51:03] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.012 second response time on port 11000 [14:52:41] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:02] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:30] <^demon> hashar: I updated wikitech-l that we're done and everything seems good. [14:58:49] \O/ [14:59:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:02:44] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [15:03:27] Nikerabbit, exception log is full of "Exception from line 356 of /usr/local/apache/common-local/php-1.21wmf6/includes/cache/MessageCache.php: MessageCache failed to load messages" [15:03:35] is it something to worry about? [15:06:20] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [15:08:30] New patchset: Cmjohnson; "Adding solr1001-1003 to dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39206 [15:08:39] whee [15:09:57] robh: wanna check my spelling and spacing plz ^^ [15:12:18] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39206 [15:12:52] cmjohnson1: done, not merged on sockpuppet, thats all you [15:13:26] haaaai paravoid, you around? [15:16:39] New patchset: Cmjohnson; "Adding to solr servers to netboot using lvm.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39208 [15:17:51] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39208 [15:19:53] !log Load-testing spatial search [15:20:02] Logged the message, Master [15:21:23] Before switching off yttrium, I gonna have some data for perf comparison:) [15:23:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:10] MaxSem: yes and no [15:30:31] MaxSem: it has failed before causing nasty side effects like sidebar reverting to default value or no gadgets [15:30:58] MaxSem: if it is happening very often it should be looked into [15:31:23] Nikerabbit, exception log currently consists mostly of this exception [15:31:52] as in, many times per second or few every minute? [15:33:38] as in, 56 of this and only 2 of any other exception in the last 1000 lines [15:35:29] try /home/maxsem/exceptionmonitor on fluorine [15:36:26] what's that [15:36:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.197 seconds [15:38:05] fluorine? [15:38:37] <^demon> It's where all the mw logs are collected. [15:38:53] *what's /home/maxsem/exceptionmonitor [15:39:04] <^demon> It's a script that MaxSem wrote on fluorine. [15:39:21] <^demon> A script we never puppetized when Roan wrote the same basic thing ;-) [15:39:45] ^demon, it's now waiting in Gerrit [15:40:02] <^demon> There was a reason we didn't puppetize it before. [15:40:22] ? [15:40:49] <^demon> Something about "watching a log every 2 seconds isn't great" or somesuch. [15:40:52] <^demon> Roan would remember. [15:41:42] you can do it in a blocking way, instead of polling [15:42:03] heh, it's a tweak of fatalmonitor which is puppetized:P [15:42:17] <^demon> I thought fatalmonitor wasn't puppetized. [15:42:26] <^demon> Oh well, I don't terribly care that much :) [15:44:46] so how does that differ from current fatalmonitor? [15:45:55] !log Testing done, 40 concurrent processes hitting around the worst-case point kept the load on yttrium at 20%. Average response time ~430ms [15:46:03] Logged the message, Master [15:46:23] that's much better than I expected [15:51:33] Nikerabbit, fatalmonitor tracks warnings/fatals from apache logs, exceptionmonitor tracks exceptions [16:02:01] I see [16:02:36] MaxSem: would be nice to get someone to debug why it fails from time to time, it shouldn't [16:04:41] Nikerabbit, http://dpaste.org/mU3yO/ [16:05:26] ottomata1: yes [16:05:50] hiya, just a general q for you, since I was thinking about it twice this week [16:06:00] what do you think about upstart vs supervisor? [16:06:32] I don't know supervisor well but I think it fills another space [16:06:40] upstart seems more standard, but supervisor is easier for development [16:06:41] yeah [16:06:58] what do we need supervisor fro? [16:07:00] for [16:07:04] supervisor is more intrusive, upstart works like service scripts [16:07:08] well, i'm not sure we need it, [16:07:20] was just wondering what you thought, i was recommending to ori that he use upstart instead of supervisor [16:07:44] but supervisor can handle log files a bit better than upstart (I think), without having to manually pipe stderr/stdout around and set up log rotate [16:07:58] also, for a sec I thought ryan F was asking for something that needed it, but it turns out he's not [16:08:03] I really think we should do all that though [16:08:45] yeah, i think for producitiony services, we should for sure, but what if people just want to test out their processes as daemons while developing? [16:08:47] I'm not a big fan of supervisor and programs like it [16:08:55] ok cool [16:09:12] yeah, was just wondering, i don't need it at all now, since it turns out ryan's thing doesn't need it [16:09:44] but I don't have very strong feelings I should say [16:10:01] it just feels strange [16:11:44] MaxSem: there is something like 5s timeout there [16:12:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:59] yeah i know what you mean, using it is a little weird, cause you have to use supervisorctl, rather than service [16:19:04] which also makes it more difficult to puppetize [16:19:29] I was going to think about implementing a supervisor service provider if we had a use for it, but I don't think we do right now [16:25:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.566 seconds [16:28:32] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:38:37] New patchset: Faidon; "partman: add flavor for Ceph boxes with SSDs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39214 [16:38:38] New patchset: Faidon; "partman: get rid of yet another prompt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39215 [16:38:38] cmjohnson1: around? [16:38:47] apergos: too [16:39:02] yep [16:39:02] yes but leaving in about 5 minutes [16:39:06] what's up? [16:39:18] so the 720xds in eqiad don't have console redirection enabled [16:39:34] I'm not sure if that's the case in pmtpa as well (hence the ping to apergos) [16:40:10] New review: Faidon; "Painfully iterated and tested." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/39214 [16:40:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39214 [16:40:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39215 [16:41:34] paravoid: k...will check and fix...will probably have to reboot anything objections? [16:41:41] any objections [16:41:50] I did 1001 and 1002 [16:42:14] I can do more, if you have spare cycles we can probably split them up [16:42:18] if not, I'll do all of them [16:42:28] I;m not sure what it looks like when it doesn't have console redirection working [16:42:52] what's more important is to have it on new boxes, it took me quite a while to manage to convince Java webstart to run the iDRAC console [16:43:09] apergos: you don't see the BIOS [16:43:14] or anything before Linux [16:43:37] well all the boxes I have touch so far are set up properly then [16:43:57] without that I would be a dead duck as I always have to go into the bios for something or other on these boxes [16:44:56] paravoid: did you check any of the others first? 1001 and 1002 were setup first [16:45:18] haha [16:45:21] no, just those two [16:45:29] might be my lucky day... :) [16:45:40] i am on 3...try 1004 [16:46:26] if not I will take care of it [16:46:36] well, we have to go into BIOS anyway [16:46:40] for the power mgmt thing [16:47:19] good point..so i will fix as necessary [16:49:43] ok, heading out, talk to folks later [16:59:08] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:41] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:13:41]