[00:00:11] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [00:00:11] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 17 seconds [00:00:38] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 15 seconds [00:00:38] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 1 seconds [00:00:38] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 1 seconds [00:00:56] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:01:59] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [00:02:17] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [00:02:17] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 18 seconds [00:03:02] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 25 seconds [00:03:47] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 13 seconds [00:04:23] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [00:05:44] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 4 seconds [00:07:41] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [00:07:50] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [00:09:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [00:15:02] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.005 second response time [00:15:56] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.011 second response time [00:24:56] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:32] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:51] New patchset: Tim Starling; "Reduce db32 read load to zero due to persistent lag" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13853 [00:35:18] New review: Tim Starling; "Already live." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13853 [00:35:20] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13853 [00:37:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:35] PROBLEM - Host mw1093 is DOWN: PING CRITICAL - Packet loss = 100% [00:46:06] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [00:46:23] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.026 second response time [00:47:17] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.012 second response time [00:47:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [00:50:17] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:50:53] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:51:38] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [00:51:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [00:51:56] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [01:02:36] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [01:05:26] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [01:06:56] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:12:38] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:18:56] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.012 second response time [01:19:14] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.014 second response time [01:23:35] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:32:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:41] PROBLEM - Host mw1095 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [01:41:35] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [01:42:47] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 258 seconds [01:42:56] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:46:05] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [01:47:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [01:48:38] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 607s [01:51:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:52:59] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 34s [01:53:17] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 6 seconds [01:56:12] Hey wikimedia people. I'm with the company (your.org) that's been working with apergos on setting up the full off-site mirror of everything (http://ftpmirror.your.org and http://dumps.wikimedia.your.org). Today we got an automated notice from Google that both of those URLs were serving malware. Their lack of specificity is making this really difficult to narrow down, so before we tear everything apart, does anyone know if th [01:56:12] images/commons files are being scanned for anything nasty? Or is there any way that you guys could have flagged something as deleted but it's still being pushed to us? [01:58:51] I'm running clamav on the entire set of everything being mirrored. The only thing it's complained about so far is http://ftpmirror.your.org/pub/wikimedia/images/wiktionary/fj/c/c4/citibank-car-loan.pdf but nothing else is triggering on that file [02:07:05] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:08:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [02:10:50] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:11:36] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:11] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.013 second response time [02:21:56] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.010 second response time [02:28:32] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [02:28:32] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [02:30:38] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [02:31:32] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [02:32:35] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [02:32:35] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [02:36:38] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [02:36:38] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [02:37:32] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [02:38:44] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [02:40:14] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:40:32] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [02:42:38] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [02:43:32] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [02:44:35] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [02:44:35] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [02:45:38] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [02:46:32] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [02:47:35] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [02:47:36] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [02:49:32] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [02:52:33] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [02:53:35] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [02:54:56] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [02:57:38] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [02:57:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:00:35] Malware in Your.org o_O [03:07:06] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Mon Jul 2 03:06:52 UTC 2012 [03:15:38] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Jul 2 03:15:28 UTC 2012 [03:15:56] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [03:17:17] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [03:17:47] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:26:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:28:08] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:36:05] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:36:23] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:56] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.008 second response time [03:39:14] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.007 second response time [03:53:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:56:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [03:58:08] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [04:53:15] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 6, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [04:56:15] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [05:09:20] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [05:13:23] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [05:56:08] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:29:25] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [06:30:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:22:26] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 185 seconds [07:22:26] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 185 seconds [07:30:51] hello [07:31:50] yo [07:31:53] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [07:33:14] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [07:33:53] apergos: the labs went wild this weekend due to the leap sec bug [07:34:00] I heard [07:34:03] I was mostly not around [07:34:05] I had some funny bugs like mysql spiking to 100% cpu [07:34:12] same with automount [07:34:15] I had that n my laptop, restarted and it was fine [07:34:27] prod was fine from what I heard beside java search i think [07:34:28] (mysql). also some java thing, oh yeah it was tomcat [07:48:41] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [07:51:32] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:00:24] New patchset: Hashar; "(bug 37457) viwikibooks: fix import sources" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:01:06] back in a little bit, errands before it gets blisteringly hot [08:01:17] New review: Hashar; "Follow up : https://gerrit.wikimedia.org/r/13860" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [08:01:39] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:02:19] ohh Jenkins has troubles [08:02:21] :-D [08:02:37] yeah ksoftIRQ madness [08:02:52] apergos: I guess we need to reboot the gallium host :-D [08:03:40] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:05:28] is it "normal" for a load loaded server to do like 800k / 1M context switch per seconds? (according to vmstat) [08:06:31] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:06:33] that on the high side [08:06:34] :) [08:07:10] that's a lot, but it may be normal [08:07:22] whenever a thread blocks, it wants to run another one [08:07:43] now, the quesiton is what overhead do context switch have on that server? [08:07:59] I have no idea how to check that [08:08:26] user / system are at 30% / 30% each, top -i just show ksoftirqd [08:08:47] platonides: thats quite a lot of thrashing, I'm used to busy DBs under 200k ;-) [08:09:58] hashar, don't worry, I wasn't expecting you to magically give out a number [08:10:18] the box does nothing :-] [08:10:49] all the cpu usage is overhead? [08:10:57] I guess so [08:10:57] that doesn't look too efficient :) [08:11:20] apergos: would you mind rebooting gallium for me please ? https://rt.wikimedia.org/Ticket/Display.html?id=3208 [08:11:47] !log Stopped Jenkins on gallium. It is not doing anything anyway. Asked to reboot box {{rt|3208}} [08:11:58] Logged the message, Master [08:16:27] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:20:39] ah puppet is sooo smart :-] [08:20:44] it restarted jenkins [08:20:45] \O/ [08:26:19] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [08:26:19] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [08:39:22] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:04:49] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [09:06:10] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [09:33:40] hashar, do you still need that reboot? [09:35:30] apergos: gallium ? yes :-) [09:35:45] restarting the java app is not enough apparently [09:35:46] :( [09:36:42] I restarted Jenkins on gallium but CPU is still high http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&c=Miscellaneous+eqiad&h=gallium.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [09:36:53] so definitely ksoftirq eating everything [09:37:50] ok [09:38:51] !Log rebooting gallium, it's pretty unhappy (maybe related to leap second issue) [09:38:54] grrr [09:39:04] !log rebooting gallium, it's pretty unhappy (maybe related to leap second issue) [09:39:15] Logged the message, Master [09:39:38] It's a long time after the leap second change but it is running java [09:40:31] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:41:38] let's see how that goes [09:42:19] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [09:43:40] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [10:10:34] apergos: works for me thanks for rebooting gallium :-] [10:10:43] sure [10:10:45] should have asked you to upgrade packages while you were at it but that will be for the next time :-] [10:31:03] PROBLEM - NTP on db12 is CRITICAL: NTP CRITICAL: Offset -1.042744517 secs [10:36:00] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [10:37:21] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [10:44:33] RECOVERY - NTP on db12 is OK: NTP OK: Offset 0.004649877548 secs [10:53:38] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [10:55:44] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [10:57:29] !log Problems on one of two pmtpa-eqiad waves; raised OSPF metric to 60 to failover traffic to the other link [10:57:40] Logged the message, Master [10:57:42] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [11:03:23] PROBLEM - Host mw1025 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:13:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:14:20] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [11:20:20] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: Device does not support ifTable - try without -I option [11:24:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:24:50] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:27:50] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:29:11] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [11:29:20] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [11:29:47] PROBLEM - Puppet freshness on tarin is CRITICAL: Puppet has not run in the last 10 hours [11:31:31] !log Now we have packet loss within pmtpa/sdtpa... reverting change [11:31:41] Logged the message, Master [11:31:53] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [11:34:55] the replag issues are known ? [11:36:48] TimStarling: Tim, can you do a little spam control on scribunto.wmflabs or give others who know how to do that some rights ? [11:41:03] there's some weird network issue going on [11:57:42] djhartman: I gave you bureacrat access [12:21:19] New patchset: Hashar; "Add a symbolic link to CREDITS for Change Ia02c3bcf." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:21:52] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:24:00] New review: Hashar; "deployed live" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:26:34] !log upgrading kernel on gallium [12:26:44] Logged the message, Master [12:27:25] PROBLEM - Host mw1044 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:48] !log rebooting gallium one more time to install kernel [12:27:58] Logged the message, Master [12:29:13] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [12:29:13] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [12:31:10] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [12:32:13] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [12:33:16] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [12:33:16] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [12:35:20] !log installing upgrades on fenari (linux-firmware linux-libc-dev..) [12:35:31] Logged the message, Master [12:36:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [12:37:10] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [12:37:10] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [12:38:13] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [12:41:13] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [12:43:10] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [12:44:13] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [12:44:49] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:45:16] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [12:45:16] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [12:45:39] pfff [12:45:52] mutante: seems db12 has some replag [12:46:06] I have no idea where to check it though [12:46:10] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [12:46:10] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:47:13] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [12:48:16] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [12:48:16] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [12:48:18] @replag db12 [12:48:18] Damianz: [db12: s1] db12: 1252s [12:49:54] well [12:50:13] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [12:50:38] apergos, how did you diagnose it last time? [12:50:49] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:51:54] how did I diagnose which? [12:51:54] ahhh http://noc.wikimedia.org/dbtree/ [12:51:55] ;) [12:52:10] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:52:10] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [12:52:10] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [12:52:10] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [12:52:14] someone else noticed it (and yes, that's an easy way to monitor them all) [12:52:27] the only thing I saw going on over there was the pupulatesha1 script [12:52:34] *oopulate [12:52:49] yup [12:52:55] it obviously doesn't pay attention to slave lag too much [12:53:13] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [12:53:44] what are the mysqldump running on db12 ? https://ishmael.wikimedia.org/more.php?host=db12&hours=24&checksum=7467891370387641567 [12:54:11] apergos, i mean when you blocked some ip requesting contribs en masse [12:55:08] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [12:56:14] hashar: DELETE /* LinksUpdate::incrTableUpdate Traveler100 */ FROM `templatelinks` ? [12:56:32] some jobrunner [12:56:33] sampled 1000 logs iirc [12:56:59] domas: the populateSha1 does wfWaitForSlave [12:57:14] well it isn't [12:57:28] should wait after each batch [12:57:39] how large is a batch? [12:58:07] that is the question ;) maybe it was made --batch=1 [12:58:13] obviously not enough [12:58:16] --batch = 100000000 [12:58:17] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [12:58:36] :-/ [12:59:01] bah ou [12:59:24] ahh populateRevisionSha1.php [12:59:28] running on hume [12:59:36] I was looking at another populate script [12:59:47] (that is on hume) [13:01:54] grgarougaema ze [13:02:02] they are running in screen sessions [13:02:05] with no log files [13:02:44] shall i tell VP/T that you guys are investigating the replag issue and point folks to server admin log ? [13:03:06] apergos: mutante the scripts have been running since June 29th. Not sure if they are the cause [13:03:55] they were already stopped once to let a slave catch up [13:03:59] see the server admin log [13:04:04] ohh [13:05:04] yeah, about 12 hours ago [13:06:45] so there is 4 occurrences of the script [13:06:49] each doing batches of 1000 [13:06:54] entris [13:07:13] so that is like 4k UPDATE [13:07:23] then wfWaitForSlaves [13:07:33] it's not that much data [13:07:35] * hashar kill -STOP aaron [13:09:30] hashar, he's in crontabs;) [13:11:35] is everywhere else done bar enwiki? [13:11:47] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 [13:12:15] Reedy: no idea sorry [13:12:48] needs faker :) [13:12:49] isn't that mysqldump overloading it --> https://ishmael.wikimedia.org/?host=db12 ? [13:13:05] that is what I was wondering [13:13:08] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [13:13:40] 2012-07-02 12:46:47 average time: 1938s mysqldump [13:13:48] it is not overloaded, btw [13:13:58] hehe [13:14:08] it's just being special and slow [13:14:15] it would need to have about 10x load to become overloaded for me :) [13:17:06] well, lag = not handling replication. sort of overloaded:) [13:19:17] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:21:19] hashar / reedy: you happen to know if this runs somewhere in cron already or is still needed in a cron? "maintenance/purgeParserCache.php" [13:21:30] well db12 is full of waiting for slave queries anyway [13:22:00] mutante: it should be still needed, we're still using db40 for mysql parser cache [13:22:08] as for if it's anywhere, no idea. Tim/Asher would know more [13:22:46] enwiki 0 | Creating tmp table | (SELECT /* SpecialRecentchangeslinked::doMainQuery XXXX */ `recentchanges`.*,ts_tags,fp_sta | [13:22:52] Reedy: yea, i see in a ticket that Tim once worked on it and ran it, but then it was open with the question "does it still need to be added to cron" [13:23:01] can't we shift load out of db12 ? [13:23:57] PROBLEM - Host mw1105 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:07] hashar: db12 is used for watchlist/rc etc [13:24:09] Reedy: i am guessing "on hume", but i wouldn't know which parameters and how often, gonna ask Asher. thx [13:24:11] oh db12 already down to 0 [13:24:43] maybe we have made a change in mw that introduce to much stress ? [13:24:52] I have seen a lot of "copying to tmp table" [13:26:02] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [13:26:05] hashar: i dunno but the "DELETE /* LinksUpdate::incrTableUpdate Traveler100.." is gone now [13:27:12] 1173 | Sending data | SELECT /*!40001 SQL_NO_CACHE */ * FROM `templatelinks` [13:27:23] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [13:28:07] on enwiki [13:29:33] well that select is just about 327Million rows [13:30:07] must be form mysqldump indeed [13:30:46] hi, need somebody from India or Indian language knowledge, especially "Assamese" to confirm if a bug is resolved. [13:31:16] i think i should try that on #wikipedia, heh [13:31:52] mutante: I know of santosh and yuvipanda (not sure they know Assasmese though) both offline [13:31:54] That sounds rather very much like vandalism lol [13:34:14] Damianz: heh, it was related to BZ 33507, but nobody reopened that.. so looks resolved [13:34:48] New patchset: Demon; "Various tweaks to gerrit.pp to get it running in labs:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [13:35:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [13:36:26] !log db12 suffering some 1400sec (and growing) replag. mysqldump in progress on that host. [13:36:36] Logged the message, Master [13:38:27] hashar: which servers would need "librsvg" to render svg? [13:38:42] image scalers at least [13:39:18] SVG rendering should be done through the thumbnailing infrastructure [13:39:59] alright, i see "librsvg2-bin" in imagescaler.pp [13:40:01] mutante: imagescaler::packages . Why do you ask? [13:40:13] 'db32' => 0, # snapshot host [13:40:14] because you asked about the version of it in some RT a few months ago [13:40:18] and it is still open [13:40:26] Reedy: local hack ? [13:40:36] lol, it's a comment in db.php [13:40:41] apergos would know [13:40:46] hashar: rsvg --version ;) [13:40:48] Reedy: the dump comes from snapshot4 yes [13:41:11] "The version in use is 2.26.3 as a WMF package" [13:41:15] *** 2.26.3-0wm1 0 [13:41:18] ..that was a while ago [13:41:34] lucid chips 2.26.3 too [13:41:36] "Our bug tracker has several bugs which might be fixed by upgrading that tool" [13:41:38] i am not sure what is in -wm1 [13:41:43] hmmm [13:41:58] https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/librsvg/debian/ [13:41:59] i hope wm1 are the fixes :) [13:42:11] I am pretty sure not :) [13:42:12] I am sorry to say it has been a long time since I looked at any of that [13:42:18] https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/librsvg/debian/changelog?view=markup [13:42:19] so what version and where... no idea [13:42:25] oh there are patches :-) [13:42:45] apergos: I wasn't actually pinging about the imagescalers. Just the host being used for the enwiki dumps [13:43:13] oh [13:43:16] mutante: the patch that matter is wikimedia-brand.patch [13:43:20] db32 is commented as the "snapshot host", but seems db12 is being used, aswell as for watchlists, rc etc [13:43:23] what about it, sorry? [13:43:39] 'dump' => array( [13:43:39] 'db12' => 1, [13:43:46] yeah it is dumping tables. these dumps hold no locks, [13:43:46] lying comment lies [13:44:08] which may not be awesome for the users of the data but [13:44:19] it means they should have minimal impact on the dbs [13:44:24] mmm [13:44:37] shame we can't use the idle eqiad slaves [13:44:49] well at some point maybe we will [13:44:55] (have an idle slave that gets used for them) [13:44:59] but in the meantime, ... [13:45:11] yeah, lots of cross dc traffic isn't good [13:45:12] New patchset: Demon; "Various tweaks to gerrit.pp to get it running in labs:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [13:45:20] well db12 as a nice "select * from `templatelinks`" query sending data 327M rows [13:45:36] yep [13:45:43] because it's dumping the whole table [13:45:48] New patchset: Dzahn; "decommission db10" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13881 [13:45:49] that's what it does [13:45:49] adding to db10 de-confusion [13:46:10] could it be making db12 lagging ? [13:46:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [13:46:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13881 [13:46:27] New review: Dzahn; " hm db10 should be decommissioned" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/13881 [13:46:50] mutante: for rsvg I think you want to talk about it with Tim. Might need to backport librsvg from Precise or something similar [13:46:58] it's slow but other things run and complete without it having been an issue, for oh a couple years now [13:47:09] (since we changed the lock options on the mysql dump invocation) [13:47:10] mutante: we would want to migrate the imagescaler to Precise anyway [13:47:40] hashar: yeah, thats what i thought, agree [13:48:10] mutante: you will want to talk about Precise upgrade during op meeting. Going to be a lot of fun :-) [13:48:54] mutante: like 6 new boxes to setup :) [13:51:49] apergos: well the lag started growing up at 6am UTC, looks like something is wrong :-/ [13:52:02] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=MySQL+pmtpa&h=db12.pmtpa.wmnet&v=1517&m=mysql_slave_lag&jr=&js=&vl=secs&ti=mysql_slave_lag [13:52:11] hashar: heh,sticking to RT for now, but creating a new one to do precise upgrade [13:52:20] Like I say, it happened about 13 hours ago [13:52:33] mutante: you will want to talk about doing precise upgrades during ops meeting [13:53:47] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [13:55:08] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [13:59:24] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [14:00:08] hashar: aha, i found more on rsvg elsewhere "the oneric version of rsvg needs to be backported to lucid. " per hexmode. linking [14:00:18] !! [14:00:18] an awesome bash trick [14:00:29] !del !! [14:00:33] finally [14:00:35] @del !! [14:00:46] !? [14:00:47] https://bugs.launchpad.net/ubuntu/+source/librsvg/+bug/921897 [14:00:51] hashar: are you doing the rsvg thing? [14:00:54] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [14:01:01] hexmode: no just talking about it [14:01:16] I am pretty sure it will only be upgraded when the imagescalers move to Ubuntu Precise [14:01:52] ^ [14:01:53] indeed [14:01:57] hexmode, hashar: RT-2548 and RT-2585 are the tickets of each other :p [14:02:15] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [14:02:19] and yes, dependency for precise upgrade [14:03:36] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [14:16:26] New review: Hashar; "some files only gain 1 bytes such as the gnu-fdl.png . They also have a text comment which could be ..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/13285 [14:26:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13316 [14:28:57] New patchset: Hashar; "detect cluster with /etc/wikimedia-realm" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13888 [14:34:15] !log Shutdown BGP session to 2828 on cr1-sdtpa [14:34:25] Logged the message, Master [14:37:54] !log Shutdown PyBal BGP sessions on cr1-sdtpa [14:38:03] Logged the message, Master [14:39:54] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:41:48] !log Rebooting cr1-sdtpa [14:41:57] Logged the message, Master [14:42:45] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:43:48] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [14:43:48] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [14:43:48] PROBLEM - Host cr1-sdtpa is DOWN: CRITICAL - Network Unreachable (208.80.152.196) [14:43:48] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [14:43:57] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [14:43:57] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [14:44:06] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [14:44:06] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [14:44:06] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [14:44:06] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [14:44:06] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [14:44:07] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [14:44:15] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [14:44:33] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [14:44:33] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [14:44:42] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [14:44:42] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [14:44:51] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [14:45:55] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:46:21] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.68 ms [14:46:21] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [14:46:21] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.10 ms [14:46:21] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [14:46:21] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.32 ms [14:46:21] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms [14:46:22] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms [14:46:23] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.30 ms [14:46:23] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.39 ms [14:46:24] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 5.14 ms [14:46:30] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.67 ms [14:46:39] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [14:46:48] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [14:46:57] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms [14:47:15] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:47:33] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.42 ms [14:47:33] RECOVERY - Host cr1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [14:48:45] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:49:39] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.84 ms [14:50:06] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:50:25] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [14:50:42] PROBLEM - Host appservers.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:51:45] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 64%, RTA = 0.45 ms [14:53:33] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 3, down: 3, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [14:53:42] RECOVERY - Host appservers.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 73%, RTA = 0.24 ms [14:54:27] RECOVERY - Host api.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:55:12] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [14:55:21] !log Reboot of cr1-sdtpa did not fix the RE packet loss issue... therefore unlikely to be leap second related [14:55:31] Logged the message, Master [14:56:33] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [14:56:42] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [14:58:54] New review: Dzahn; "abandon? close RT-3078?" [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9874 [15:01:12] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [15:01:12] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:04:16] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [15:09:40] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 4, down: 2, shutdown: 0BRPeering with AS14907 not established - The + flag cannot be used with the sub-query features described below.BRPeering with AS64600 not established - BR [15:10:07] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [15:10:35] New review: Jeremyb; "I have no idea what RT 3078." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [15:12:40] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [15:14:10] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [15:17:19] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [15:18:58] PROBLEM - Host mw1096 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:24] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13888 [15:21:40] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [15:25:10] New review: Hashar; "That change broke enwiki, and possibly over wikis, by making the resourloader drop the 'href' elemen..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13322 [15:26:28] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [15:29:22] since the dumps just started on the next table, I interrupted them [15:29:27] (for en wiki) [15:30:06] I still don't see how the dumps by themselves can be to blame, since they are running the same stuff they have been for months and months, but I don't know what to look at over there [15:30:12] on db12 [15:32:14] * jeremyb waves mutante ;) [15:32:47] at this point the replag is 2087. let's see what it is in ten minutes [15:33:01] anyone want to test bug 31369 with me? ;) redirects.conf [15:34:09] jeremyb: sure [15:35:02] Jeff_Green: so, either deploy everywhere and i'll test or deploy to a specific backend and find a way for me to get there? i only have access to labs [15:35:19] Jeff_Green: or i give you a bash script with curls to run [15:35:20] there's a staging box we can use [15:35:24] sec [15:35:46] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [15:35:56] srv193 [15:36:26] jeremyb: jobs.wm better for you? ack? [15:36:35] mutante: i mailed [15:37:07] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [15:37:17] i see. thx [15:37:40] mutante: and updated the bug with a link to 2402 [15:38:10] great. did the same in RT [15:38:55] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:40:11] db12 replag is now 2133 [15:40:15] jeremyb: yeah, you can do that in gerrit, but just note that currently it needs git and also still svn, unfortunately [15:40:16] no dump process. [15:40:25] I think we can rule that out as a cause. [15:40:39] mutante: err, huh? [15:40:58] jeremyb: if somebody merges in gerrit it will not be deployed to cluster automagically yet [15:40:59] mutante: what's svn? [15:41:00] Reedy: hashar: ^^ [15:41:22] mutante: well sure. same with puppet and mediawiki and initialisesettings and...? [15:42:07] apergos: just over half an hour? bleh [15:42:11] jeremyb: hmm, not quite the same. apache config is still in subversion [15:42:17] have we got some stupid rc/watchlist queries running [15:42:18] * Reedy looks [15:42:40] mutante: errr, ugh?! [15:42:49] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [15:42:50] well so the point is this has been going on much of today [15:42:57] mutante: Jeff_Green: back in a bit [15:43:03] k [15:43:09] the only thing that's been steady today (I think) is the populate sha1 stuff [15:43:15] That's a lot of slave waiting [15:43:25] yes it is [15:43:44] if it had actually bee the dumps process I would have expected the numbers to drop slowly once I shot it [15:43:45] but nope [15:43:56] and apparently that's all the machine is doing [15:44:07] (2151 now) [15:44:27] uh huh [15:44:57] with the odd contribs query [15:45:34] maybe it's worth another kill -CONT on populateRevisionSha1.php for a little while? [15:45:40] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:52] er [15:46:04] -STOP , sorry [15:46:07] Might be worth it [15:46:12] New patchset: Hashar; "DO NOT SUBMIT (bug 37245) makes labs use bits.beta.wmflabs.org DO NOT SUBMIT" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13892 [15:46:17] Not going to make anything worse [15:46:34] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [15:47:22] 8 of em at once, meh [15:48:14] New review: Hashar; "Do not even think about submitting this change cause it will break the wikis horribly sending back t..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/13892 [15:48:49] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [15:50:10] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [15:50:46] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [15:51:25] apergos: back sorry [15:51:49] Reedy: I have see a few queries doing "copying to temp table". [15:51:59] Indeed [15:52:04] Reedy: though not that many and they were not lasting a long time [15:52:08] Watchlist/RecentChanges etc aren't nice [15:52:09] so I thought it was harmless [15:52:51] I figure those are always happening [15:53:22] so the q is what's actually pushing it past th rbeaking point [15:53:36] (I haven't stopped those populates yet) [15:54:09] indeed [15:54:23] it's only 4 of them but there's the parent shell too [15:54:23] I can't believe adding some sha1 hashes is taking the machine down [15:54:33] yeah but [15:54:37] there must be more traffic from new revisions [15:54:38] http://wikitech.wikimedia.org/view/Server_admin_log [15:54:53] see here... for db32 [15:55:18] yeah [15:55:30] well it was 2153 and now it's 2127 [15:57:05] Hopefully asher should be about in an hour or so [15:57:13] could something have just completed? but I don't know what it would be [15:59:49] @replag db12 [15:59:50] djhartman: [db12: s1] db12: 2092s [15:59:55] yup, going down [16:00:09] Has anyone looked at the logs on the machine? Just out of interest [16:00:19] me, and I saw nothing helpful [16:00:36] right, just making sure :) [16:00:42] yeah [16:01:47] I sure hope asher can solve it, I need to be able to run my dumps (and sadly I cannot direct them to run on a specific host, as a general rule, they follow what the db load balancer returns) [16:03:26] are the snapshot host comments in db.php a lie then? [16:03:47] /legacy [16:04:55] I don't even know about those [16:05:21] I suspect it's a yes then [16:05:22] heh [16:05:32] those mean something else [16:05:45] prolly llvm or something [16:05:55] nothing to do with dumps [16:05:59] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [16:06:44] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [16:08:37] @replag db12 [16:08:37] djhartman: [db12: s1] db12: 2005s [16:08:55] ok... going down quite a bit. [16:09:44] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 5, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [16:15:53] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [16:22:33] Jeff_Green: back for at least 30 mins [16:22:45] ok [16:23:00] so one observation: your merge is against a now-outdated redirects.conf [16:23:09] i can rebase [16:23:16] does it not merge cleanly? haven't looked [16:23:21] lemme make sure the live stuff is actually checked in [16:23:23] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [16:23:30] hah [16:23:40] it's in two revision control systems at the moment [16:25:15] ok, git is in sync with svn again . . . [16:26:08] Jeff_Green: did you do a push to gerrit? i'm not seeing it [16:26:29] honestly, I don't know [16:26:33] heh [16:26:54] i don't really understand how git+gerrit are set up in terms of the http conf [16:27:19] oh wait [16:27:20] well if you updated git i don't see what you did [16:27:21] what do you need to know? :) [16:27:42] <^demon|lunch> Jeff_Green: gerrit is available at localhost:8080, apache proxies it out over gerrit.wm.o:80/r/ [16:27:44] <^demon|lunch> :) [16:27:53] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [16:27:56] <^demon|lunch> :443, even [16:28:12] so, after I commit on fenari, do I push for review? [16:28:12] ^demon|lunch: that is so not what he means [16:28:14] or do I push? [16:28:29] how is this supposed to work? [16:28:35] <^demon|lunch> jeremyb: Oh, whoops. [16:28:37] <^demon|lunch> Ignore me. [16:28:40] <^demon|lunch> Going back to my lunch [16:28:45] ^demon|lunch: enjoy! [16:29:40] Jeff_Green: so the problem with push is that no notifications are ever generated [16:29:54] Jeff_Green: you can +2 yourself and then merge immediately [16:29:56] Jeff_Green: i think i know [16:30:19] Jeff_Green: you commit to gerrit for like documentation and review and then you go to fenari as you did before [16:30:32] Jeff_Green: and also commit to svn and use the old sync scripts to push it [16:30:53] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [16:31:21] <^demon|lunch> jeremyb: So yeah, that needs fixing. Badly. [16:32:11] ^demon|lunch: also, we need disposable branches where people can push things so they don't need to use github or gitorious for work on gerrit branches [16:32:33] ^demon|lunch: i can think of a few more things too :P [16:32:33] <^demon|lunch> Totally doable. Found a way to do that with permissions last week :) [16:32:34] ^demon|lunch: actually /h/w/conf/httpd is both now, "svn status" and "git status" return stuff :p [16:32:41] ^demon|lunch: whoa [16:32:52] Jeff_Green: you don't commit on fenari [16:33:09] Jeff_Green: you push into gerrit from your local repo, then pull from fenari [16:33:09] you dont? how do you use sync-apache then [16:33:25] <^demon|lunch> jeremyb: https://gerrit.wikimedia.org/r/Documentation/access-control.html#_project_access_control_lists -- see the paragraph starting with "References can have the current user..." [16:33:30] *for all public repos [16:33:38] for the private repos, you just commit [16:33:41] nothing else needed [16:33:43] mutante: Jeff_Green: i thought there was a repo with some stuff gitignored and other stuff svnignored so stuff doesn't go in both repos. maybe this is that repo? [16:34:01] you know, I have no idea anymore [16:34:22] last I knew, it was: make your change locally at fenari:/home/wikipedia/conf/httpd and svn ci stuff [16:34:34] then suddenly it was "omg there's git layered on top of that" [16:34:43] with zero documentation [16:34:46] so . . . [16:35:00] I've done the first step, everything is checked in to svn [16:35:14] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [16:35:18] last i heard was that i need to commit to gerrit and also do it the old way [16:35:24] and I did a "git commit -a" to make sure I'd catch everything local that is not in git yet [16:35:31] and that is where I stopped [16:35:34] Jeff_Green: there's always been docs [16:35:53] well, i did it in svn and then hashar updated gerrit with one change that included like the last 3 svn commits to sync [16:35:58] Jeff_Green: http://wikitech.wikimedia.org/view/How_to_deploy_code [16:36:34] it's been there since Nov 2010 ;) [16:36:36] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [16:37:09] Ryan_Lane: I don't really see how taht doc covers this situation [16:37:20] what are you attempting? [16:37:23] apache change? [16:37:32] yep [16:37:35] ag [16:37:36] ah [16:37:37] redirects.conf etc [16:37:41] yeah, that's always been undocumented [16:37:46] :( [16:37:51] so [16:37:55] it's the same as before [16:38:00] make a local git commit [16:38:02] push it out [16:38:10] unless that's been moved to gerrit? [16:38:17] I have no fucking clue [16:38:20] * Ryan_Lane sighs [16:38:20] so how does that get back into gerrit so jeremyb can haxor it [16:38:21] I think hashar created the repo [16:38:21] it is in an in-between state [16:38:26] i talked with hashar [16:38:26] it's just not been properly migrated [16:38:33] in-between states and production make me want to stab [16:38:33] oh right [16:38:36] there is no new deployment script [16:38:40] yet there is a gerrit repo [16:38:42] he was worried about sensitive stuff [16:38:45] and svn is not removed [16:38:51] oh christ [16:39:03] so hashar wants to keep gerrit repo in sync [16:39:05] people can't leave things in inbetween states [16:39:12] but you still need to use svn to actually push stuff to cluster [16:39:14] * andrewbogott may be 5 or so minutes late to ops meeting. [16:39:17] if you're going to do something, do it all the way [16:39:24] That's what she said [16:39:31] heh [16:39:46] Jeff_Green: well, that cusk [16:39:48] *sucks [16:40:01] that's almost cask [16:40:09] so if you ignore git and just do svn, you will be able to make actual changes, and then it will make hashar commit it in gerrit as well [16:40:31] lovely... [16:40:49] I can see leaving something in this state for a day or so [16:40:55] but I remember seeing that email last week [16:41:14] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [16:41:15] i think it's been over a week [16:41:21] * Ryan_Lane sighs [16:42:18] <^demon|lunch> I want to see us ditch the last of SVN too, but leaving things in half-migrated states puts in a bad place. [16:42:25] <^demon|lunch> Where we waste almost half an hour figuring out wtf to do. [16:42:35] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [16:45:17] enwiki is down [16:45:44] it was temporarily [16:45:46] it seems [16:45:52] yeah, for a few sec. [16:45:58] Jasper_Deng: back? [16:46:04] yeah [16:46:06] where's nagios at? [16:46:28] also, i asked a few days ago: how do we get watchmouse notifs in IRC? [16:46:55] where is the watchmouse config (that says what to monitor)? in version control? [16:46:57] write a bot that parses the notification emails [16:47:51] the watchmouse config is changed in a web ui [16:48:16] ugh [16:48:38] http://status.wikimedia.org/feed/8777 seems unuseful [16:48:53] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [16:49:04] if they had a decent feed with pubsubhubbub that would be great [16:49:22] anyway, how do you subscribe to the emails? [16:52:05] jeremyb: i guess we could subscribe a mailing list [16:52:25] and then use mailman for the subscribers [16:52:34] ok... do they send recovery mail or just broken mail? [16:52:53] depends on each single check config afair [16:53:05] maybe you rather want Nagios? [16:53:15] ..as RSS? [16:53:33] no, watchmouse sometimes alerts and nagios doesn't. i want to know about those too [16:53:47] in particular i want IRC to know about them [16:54:22] (this current enwiki case is one example and another is gerrit ~5-10 days ago) [16:54:29] i guess easiest is to add one new user to watchmouse, forward to a list, subscribe to list, parse mail by bot [16:54:42] right [16:54:53] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [16:55:10] i was just wondering if it also sends mail for recoveries or just it's broken [16:55:12] Unless it's alerting about the server mailman is on. [16:55:18] hah [16:55:58] well, nagios will notice that though, pretty sure :) [16:56:01] You could parse the status page every x seconds but pubsub would be far better or even http streaming. [16:56:14] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [16:56:25] nah, i prefer pubsub i think [16:56:59] * jeremyb wonders who knows if we can ignore all of these BGP alerts. no leslie [16:57:04] mark? [16:57:16] he is working on them [16:57:25] ahh [16:57:27] Mark has been doing network stuff all day for solve some issue [16:57:43] notpeter: you there? [16:58:28] notpeter: can you please merge and push https://gerrit.wikimedia.org/r/13893 [16:59:10] jeremyb: watchmouse has been renamed to nimsoft cloud user experience monitor [16:59:14] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [16:59:26] jeremyb: "Nimsoft Cloud User Experience Monitor supports all types of APIs, including REST, SOAP, oAuth, JSON, XML, RSS feeds, openID and XML-RPC." [16:59:28] mutante: what a mouthful [17:00:35] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [17:03:17] mutante: funny URL! http://www.nimsoft.com/solutions/nimsoft-cloud-user-experience.html/.html [17:03:38] heh, yes [17:05:14] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [17:06:22] preilly: i think he's afk. any time? [17:06:34] mutante: I need it now [17:07:22] New review: Dzahn; "inetnum: 41.203.128.0 - 41.203.159.255" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13893 [17:07:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13893 [17:08:07] preilly: done. looks like Orange Niger indeed [17:09:00] PROBLEM - Host mw1076 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:16] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [17:20:52] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:23:17] meta is down atm [17:23:25] just came back up [17:27:55] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [17:29:43] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [17:30:05] Jasper_Deng is our watchmouse for now i guess [17:30:16] am I? [17:30:25] Well you can't count on me 100%, ofc [17:30:51] * jeremyb wants an SLA [17:32:16] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [17:34:10] SLA? We get to pump Jasper_Deng full of caffine and hold his eyes open with duct tape? COOL [17:34:26] * Jasper_Deng does not use caffeine [17:38:52] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:40:32] is it true that edits to templates can have serious performance impacts? say a template used in 10 million pages. [17:40:49] down [17:40:53] (enwiki) [17:41:07] back up again [17:41:43] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [17:44:52] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:45:55] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [17:46:13] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 37%, RTA = 0.20 ms [17:47:20] ToAruShiroiNeko: sure? is there some followup question? don't wait to ask... [17:49:28] ToAruShiroiNeko: edits to templates are more expensive than edits to non-templates. edits to popular templates are more expensive than edits to lonely templates. expense can be stretched out over time but it's still expense. idk what's currently working or not. (see e.g. Dispenser's bot and Tim's push on the job queue) [17:50:48] ToAruShiroiNeko: https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#3_million_null_edits [17:56:04] Jeff_Green: so, let me know when you figure out the procedure? or are we just waiting for hashar? [17:56:47] i think it's up to hashar at this point [17:57:18] in fact, I'm not even sure how I would pull from git to production for configs [17:57:46] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [17:58:42] Jeff_Green: i hear wikitech calling... ;/ [17:59:27] afaik there's a giant glaring divot in wikitech on this exact subject [17:59:46] no i mean asking for someone to make a page ;) [18:00:09] yup. [18:07:31] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:12:01] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [18:14:07] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [18:15:33] it's a hashar ;) [18:16:22] hi [18:16:22] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [18:16:26] * Damianz tickles hashar with a feather [18:17:10] New patchset: Pyoungmeister; "copying roles from site.pp to role/apache.pp plus cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13898 [18:17:43] New patchset: Pyoungmeister; "removing myself from nagios until I return from vacation on 2012-07-09" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13899 [18:17:45] hashar: so, it seems most people don't understand how to deploy apache conf changes in this git+svn world. or, well, just about anything else about the setup [18:18:03] I did sent an email to the internal operations list [18:18:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13898 [18:18:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13899 [18:18:34] jeremyb: willing to get names :-] [18:18:36] hashar: i think people still don't get it. but i haven't seen the email of course [18:19:09] hashar: Jeff_Green was trying to test bug 31369. other people in this channel had no answers for him [18:19:48] jeremyb: I have copy pasted the email to your mailbox :) [18:19:55] * jeremyb reads [18:20:02] hashar: the gist is that we have changes that are in svn and local git but not pushed back to gerrrit, and we're not sure how to do that sanely [18:20:17] 02 16:33:43 < jeremyb> mutante: Jeff_Green: i thought there was a repo with some stuff gitignored and other stuff svnignored so stuff doesn't go in both repos. maybe this is that repo? [18:20:27] Jeff_Green: there is no way!! it is insane right now :-))) [18:20:38] Jeff_Green: have you read my email to the operations list? [18:20:44] hashar: that should be :-(((( [18:21:03] whoa, RCS?! [18:21:11] what century is this? [18:21:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13551 [18:21:58] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13899 [18:21:59] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13898 [18:22:28] ok, done reading [18:22:38] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [18:22:44] nothing really new there i think? [18:23:01] jeremyb: we used RCS a long time ago. I am not sure cvs let you create a local filerepo and rcs is essentially the same anyway (no svn did not exist yet :p ) [18:23:03] hashar: so, should you be able to just `git pull` on fenari? [18:23:19] hashar: i've used cvs locally [18:23:21] Jeff_Green: so yeah the files are tracked in both a local svn repo and a remote/public repo [18:23:27] Jeff_Green: so you basically have to commit twie [18:23:47] Jeff_Green: if there is really nothing sensible in the svn repo, we could migrate everything to the public git one [18:23:58] sensitive* [18:23:58] so do I essentially do push_for_review_production ? [18:24:01] Jeff_Green: and/or we could set up a second private git repository [18:24:08] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [18:24:12] Jeff_Green: master not production [18:24:22] so basically, do your change on fenari and svn commit them as usual [18:24:28] <^demon> How about one repo that makes sense rather than 2 git repos and an svn repo that doesn't make sense? [18:24:29] then deploy the change with apache-sync or something [18:24:35] yup that part is done [18:24:40] then I did "git commit -a" [18:24:42] once done you can git add the changes you made and send them for review [18:24:43] and stopped after that [18:24:45] then merge in gerrit [18:24:52] let me check [18:25:24] so currently we have some setting for enwiki => true [18:25:32] in InitialiseSettings.php [18:25:42] grr [18:25:44] wrong dir [18:25:46] * hashar hides [18:25:52] ha [18:26:22] * jeremyb gets out the xray gun [18:27:17] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [18:27:17] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [18:28:37] grrr [18:28:38] I hate svn [18:30:08] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [18:30:26] !log set up ignore file in httpd configuration directory [18:30:35] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:30:36] Logged the message, Master [18:31:02] Jeff_Green: so indeed we have several change in the git master branch [18:31:15] Jeff_Green: svn status is empty so everything is probably committed to svn already [18:31:22] yes [18:31:34] git log --oneline --decorate --graph [18:31:38] many of those changes are mine [18:31:40] that will show you the git commits in there [18:31:56] (origin/master) is the remote branch. Currently pointing to d2bdf3d [18:32:18] 4 commits later (de2c12c) are the pointers master (local branch) and HEAD (the current working copy) [18:32:29] the first thing is to update the remote (aka origin) [18:32:37] New patchset: Jalexander; "US only for shop link" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13900 [18:32:41] cause some change might have been merged in origin [18:33:08] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [18:34:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13900 [18:34:13] hashar: so can we do a little merge party to make origin/remote/gerrit resemble what's current in svn? [18:35:19] yes sure [18:35:27] the first thing is to update the remote [18:35:38] git fetch origin [18:35:38] or git remote update [18:35:46] that will fetch the latest pointer value for origin/master [18:35:52] I let you do it :) [18:37:05] fetched [18:37:38] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [18:37:51] so there's plenty o diff of course [18:38:05] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:38:38] Jeff_Green: http://dpaste.org/0Y6W9/ [18:38:39] indeed [18:38:52] --graph is lovely there [18:39:01] it instantly shows you that the branches have diverged [18:39:11] so we have to rebase the local master on top of origin/master [18:39:30] that is where my git skills are toping of, I need to read the man page for git-rebase to find out how to use git rebase --onto [18:39:49] ok [18:40:09] of course that the worse man page ever :/ [18:40:20] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [18:40:27] lets try [18:40:40] $ git rebase master origin/master [18:40:41] First, rewinding head to replay your work on top of it... [18:40:42] Applying: redirects.conf: mk whitespace consistent [18:40:43] Applying: comment out redirect cfp.wikimania.wm [18:41:05] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:41:10] bah epic failure [18:41:40] those are my commits you're quoting ;) [18:41:58] hashar: do you need some help with the surgery? [18:42:08] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [18:42:26] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [18:42:36] sure [18:42:51] I screwed the local repo :) [18:43:32] so hmm [18:43:35] I did the hard way: [18:43:51] git reset origin/master && git cherry-pick (the commits from the master branch) [18:43:53] that is ugly [18:44:27] Jeff_Green: I have did some ugly work [18:44:46] jeremyb: given the log graph at http://dpaste.org/0Y6W9/ how would one put the 4 changes from master onto of origin/master ? [18:45:11] I know we can do it but every time I want to do that I end up spending half an hour testing it in a demo repo [18:45:20] and eventually give up by just doing cherry-picking [18:46:08] ha [18:46:11] Jeff_Green: I cleaned up the repo. I have put master (local) to follow remotes/origin/master [18:46:21] git log --oneline --graph --decorate gives a nicer output now [18:46:32] hashar: i don't see which are in origin and which aren't. i guess i could ask gerrit [18:46:40] hashar: going forward how should we make changes? [18:46:50] jeremyb: in http://dpaste.org/0Y6W9/ the first 4 are not in gerrit I think [18:47:02] Jeff_Green: now that the repo is clean locally, we can push the 4 commits to gerrit :) [18:47:22] Jeff_Green: as to how to make change, please ask in ops list [18:47:30] hashar: nobody knows [18:47:31] Jeff_Green: I know RobH was wondering the same [18:47:38] everybody is wondering the same [18:47:46] so you have to discuss about it I guess [18:47:51] couldn't you just rebase on top of the current origin/master ? [18:47:55] well my suggestion is to rip out git for now [18:47:56] what was i wondering? [18:48:10] and leave just svn, and not be in a half-way state where people don't know where to initially submit changes [18:48:10] RobH: talking about httpd conf and git [18:48:16] Jeff_Green: the idea is to drop svn :) [18:48:50] New patchset: Hashar; "modified redirects.conf for educacao.wikimedia.org per RT #3138" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13904 [18:48:52] New patchset: Hashar; "redirecting wikimediafoundation.org/wiki/Donate -> donate.wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13906 [18:48:53] New patchset: Hashar; "re-syncing git to svn, :-(" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13907 [18:49:02] here are the changes ( git push origin master:refs/for/master ) [18:49:10] who was running mysqldump on db12? [18:49:28] binasher: hello asher we talked about this morning. snapshot4 was doing a mysqldump for the dumps [18:49:47] hashar: can we do that yet? [18:49:56] binasher: I though about migrating the special queries to another idling db from s1, but did not want to cause any havoc [18:50:05] hashar: thanks.. the xml data dumps? [18:50:09] Jeff_Green: i think there's still a question about sanitizing? [18:50:30] Jeff_Green: because some files were already public and some weren't. why weren't they? should they remain secret? [18:50:30] Jeff_Green: can you merge the changes in gerrit one after the other ? 13904 / 13905 / 13906 / 13907 ? [18:50:41] hashar: yep [18:50:46] binasher: I think so. We did some log in serveradmin logs [18:51:08] hashar: why didn't the bot mention 13905? [18:51:17] binasher: [18:51:28] the dump process runs a series of mysqldumps [18:51:33] we do this every month (enwiki) [18:51:42] like clockwork, no table locking [18:51:53] binasher: the problem we had is that db12 received queries from 3 sources: 1) populateSha1something (Tim kill -STOP them sunday), run by Aaron on hume 2) the mysqldump 3) the usual whatclist / specials queries [18:52:35] binasher: mysqldump is not going to do any harm since it is set to avoid locking the tables [18:52:37] Change merged: Jgreen; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13904 [18:52:54] oh hi apergos :) I tried to wrote a summary of today for asher. Feel free to correct [18:53:05] jeremyb: some packet loss maybe? [18:53:11] however until we know definitively what went wrong I don't want to start en wiki dumps again and leave them unattended [18:53:12] hashar: locking tables isn't the only issue mysqldump causes [18:53:18] jeremyb: I don't know really [18:53:37] hashar: errrrm? no UDP here i think [18:53:37] so we ended up wondering what was happening [18:54:02] we did investigate the whatchlist queries, some of them doing 'writing to tmp (table or file? can't remember)' [18:54:06] hashar: running it in a single transaction without locking tables blocks purging for the duration which kills writes once the iblog fills [18:54:29] jeremyb: yeah that is a file, would need to look at the log file. Not sure where it is though and I probably don't have access on gerrit server. [18:55:01] Change merged: Jgreen; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13905 [18:55:17] hashar: sure. but i'm just saying it can't be packetloss [18:55:40] Change merged: Jgreen; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13906 [18:55:46] jeremyb: maybe the bot forgot about it ? [18:56:02] Jeff_Green: so that is a bit tedious :-] [18:56:04] hashar: i think not [18:56:20] hashar: scriptable! [18:56:23] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 28 seconds [18:56:29] no access on server so no idea what went wrong [18:56:42] Change merged: Jgreen; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13907 [18:56:47] hashar: i mean the tedium is scriptable [18:56:50] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 25 seconds [18:57:08] hashar: yes, I think it's not a reasonable state to have two revision control systems in parallel [18:57:24] binasher: you will want to discuss about db12 / dump with ariel I guess. [18:57:27] especially if one can't take the lead [18:57:39] binasher: I am not even sure I know how innodb works :-D [18:58:01] Jeff_Green: well, what's in those other files? noc.conf, wikimedia-ssl.conf, others ? [18:58:08] Jeff_Green: that is kind of transition I guess. the svn repo probably has nothing private in it so we could just send the files ini git [18:58:48] Jeff_Green: they are probably find to go though [18:59:05] Jeff_Green: we just added to git the files that were already public, not knowing what to do with the other files [18:59:14] yep [18:59:43] where it gets confusing is when people expect to be able to work git-->gerrit-->prod when everyone else is working svn-->prod-->git [18:59:53] if there is really nothing private there, we could just add the files in git and forget about the svn repo [19:01:18] there's question about the propagation tools too, I don't know much about that [19:01:38] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [19:03:01] Jeff_Green: I have replied to Ryan reply to my ops mail and added you to bcc: [19:03:27] cool [19:03:33] Jeff_Green: as for propagating, I guess the idea would be to make the change on your comp, submit to gerrit, submit, git pull, deploy [19:03:37] aka like puppet more or less [19:03:51] yup, that would be fine imo [19:03:56] though you can do all of that on fenari directly since the remote repo is ssh://gerrit.wikimedia.org…. (aka no username) [19:04:37] once we're switched to git people shouldn't work directly on fenari, too dangerous [19:04:44] Jeff_Green: I am just not going to be the one making those files publicly available. I have no idea how it would affect us to reveal them (I am probably just paranoid) [19:04:56] think of how often people throw up their arms with "my git checkout is totally broken" [19:05:20] hashar: yep [19:05:26] we could add a little consistency script [19:05:43] to make sure no unwanted change is in the working copy [19:05:56] and that (master) is 0 or 1 change ahead of (origin/master) [19:06:01] and git diff is empty :) [19:06:04] that might help [19:06:36] we could even make fenari use http instead [19:06:49] what I would love is a gitfs though. So we could have the apaches includes /mnt/git//main.conf [19:06:52] although that barfs frequently b/c gerrit is breaky [19:07:38] Jeff_Green: mediawiki has ssh:// so we can easily write live hacks directly on fenari and sync them without using gerrit [19:07:52] ah [19:07:54] that is used for emergencies [19:08:02] when you can't wait to push to gerrit + click a button + git pull [19:08:19] you just want to: git checkout HEAD^ && sync-file faulty.php [19:08:40] (aka checkout the version prior '^' to the working copy version 'HEAD' [19:09:28] Jeff_Green: daughter crying need to go [19:09:31] sorry :/ [19:09:38] follow up on operations list :) [19:13:10] This site is experiencing technical difficulties.

Try waiting a few minutes and reloading.

(Cannot contact the database server: Unknown error (10.0.6.48))


You can try searching via Google in the meantime.
[19:13:10] Note that their indexes of our content may be out of date.
[19:14:02] s1 master [19:14:23] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:17:23] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:22:29] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [19:31:31] AaronSchulz: can you throttle the revision sha1 updates a bit? [19:32:27] binasher: I guess [19:32:40] binasher: lower rate or smaller chunks? [19:33:05] how is it currently chunked? [19:33:35] AaronSchulz: you need ALL of the wfWaitForSlaves() [19:35:15] re: chunking, the actual updates are one at a time, so lower rate would be needed or slave lag awareness as reedy says [19:35:31] There are the wfwaitforslave calls in a few places [19:35:56] * apergos lurks (sorry for being less intereactive, I erealized I had got to get dinner) [19:35:57] yeah [19:36:04] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=maintenance/populateRevisionSha1.php;h=1d8e4c8ba6393740c41f827d764efe969ca0fd15;hb=HEAD [19:36:17] batch is 200 by default [19:38:37] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [19:39:43] Reedy: it was running higher that 200 [19:39:52] I just halved it [19:39:54] ah [19:41:10] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [19:41:55] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [19:43:15] lag going up [19:43:43] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:44:28] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 191 seconds [19:44:36] boing [19:45:01] hmm [19:45:31] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 199 seconds [19:46:07] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [19:46:51] * AaronSchulz drops the batch size some more [19:49:01] and replag is going down [19:49:27] was [19:49:52] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 5 seconds [19:50:19] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 4 seconds [19:51:24] binasher: it's 300 rev chunks atm [19:52:07] PROBLEM - Host appservers.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [19:52:22] * apergos grans [19:52:27] *groans [19:52:28] this could be a long night [19:52:52] RECOVERY - Host appservers.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:53:37] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [19:54:51] yay replag 0 finally [19:55:10] I guess I will not restart dumps tonight and go to bed, I'll do it tomorrow morning [19:55:45] Change abandoned: Hashar; "Since I killed stylesheets on enwiki last week, I am going to override those settings in CommonSetti..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13892 [20:01:16] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:07:34] PROBLEM - SSH on pdf3 is CRITICAL: Server answer: [20:10:25] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [20:13:16] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:16:16] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [20:19:16] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:23:38] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 51, down: 0, dormant: 0, excluded: 0, unused: 0 [20:23:38] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [20:25:58] New patchset: Hashar; "(bug 37245) makes labs use bits.beta.wmflabs.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13932 [20:28:07] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 3, down: 3, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [20:28:25] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:31:16] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [20:36:40] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Mon Jul 2 20:36:16 UTC 2012 [20:38:28] New patchset: Hashar; "labs: remove show (enabled in production)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13933 [20:38:59] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13933 [20:39:31] New review: Hashar; "safeguarded properly." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13932 [20:39:34] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13932 [20:41:03] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:42:24] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [20:54:33] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [20:58:27] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [21:04:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [21:10:00] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [21:14:03] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [21:14:30] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:17:03] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:17:39] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [21:23:30] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [21:25:27] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [21:28:00] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 5, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [21:28:54] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [21:29:30] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [21:29:57] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [21:30:33] PROBLEM - Puppet freshness on tarin is CRITICAL: Puppet has not run in the last 10 hours [21:33:33] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [21:34:09] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [21:34:43] New patchset: Bhartshorne; "moving around container rings to more heavily weight the SSDs (and thereby improve performance)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13938 [21:35:15] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13938 [21:36:06] PROBLEM - Host cr1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:37:54] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:38:03] RECOVERY - Host cr1-sdtpa is UP: PING WARNING - Packet loss = 44%, RTA = 59.03 ms [21:39:05] maplebed: are there several ssds up now? [21:39:13] AaronSchulz: no. [21:39:33] I'm just adjusting the weights so that it's more probable that every container has one copy on ms-b5. [21:39:49] AaronSchulz: turns out we won't get more SSDs till thursday or friday. [21:39:51] RECOVERY - Puppet freshness on ms-be4 is OK: puppet ran at Mon Jul 2 21:39:38 UTC 2012 [21:40:18] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:40:27] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:41:19] maplebed: what kind of ssd's are you using? [21:41:21] RECOVERY - Host api.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 28%, RTA = 0.55 ms [21:41:22] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:34] maplebed: we are going to pull a bunch of them from the mc hosts [21:41:57] uh oh, hey guys [21:41:57] binasher: I'm not actually sure. RobH should be able to tell you. They're intel 160G, if that helps. [21:42:02] can someone help out with emery? [21:42:05] it looks borked for a few days [21:42:09] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=emery.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [21:42:12] can't ssh in [21:42:21] E3 team just noticed and told me [21:43:05] LeslieCarr, you got a sec to poke it with a stick? or maybe someone can tell me if I can poke it (is there a web console interface or sumpin?) [21:43:27] maplebed: that sounds like the x25-2's that are in the cp and mc hosts.. i suppose they've already been ordered but there are 16 that could be pulled from mc hosts and used immediately if you need them [21:43:36] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:54] binasher: we didn't order any more. I've been told there were sufficient at the colo already to populate the new hosts. [21:43:54] PROBLEM - LVS HTTPS IPv4 on upload.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:00] the holdup isn't the disks but the person to install the disks. [21:44:01] :( [21:44:12] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60499 bytes in 1.027 seconds [21:44:15] ah, ok [21:44:30] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:44:34] (stoopid vacations. why do we let people take vacation?) [21:44:50] ottomata: let me try and connect to the management interface. [21:44:57] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43453 bytes in 3.010 seconds [21:45:15] RECOVERY - LVS HTTPS IPv4 on upload.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 587 bytes in 1.885 seconds [21:45:15] thanks maplebed [21:45:42] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:42] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:42] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:42] PROBLEM - check_minfraud_primary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:57] huh. All I see on console is: [21:45:57] [1948810.593379] [21:46:00] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 61%, RTA = 0.23 ms [21:46:04] mean anything to anyone? [21:46:22] ha, not to me [21:47:57] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [21:48:33] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [21:48:43] !log rebooted emery - it's been unresponsive for 3 days. [21:48:51] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [21:48:52] Logged the message, Master [21:49:00] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [21:49:09] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [21:49:11] maplebed: http://paste.debian.net/177473/ [21:49:55] wfm. [21:50:03] are you in the EU? [21:50:06] yea [21:50:11] lots of packet loss to the squid's public ips aka 208.80.152.66 [21:50:27] RECOVERY - Host emery is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [21:50:36] for reference I noticed it getting slaggier and slaggier gradually [21:50:36] PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:36] PROBLEM - check_minfraud_primary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:45] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:46] mark, LeslieCarr ^^^^ [21:50:54] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:51:02] is that a side effect of what you're working on? [21:51:03] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:51:03] PROBLEM - NTP on emery is CRITICAL: NTP CRITICAL: Offset unknown [21:51:03] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/arabic-banner.log, /var/log/squid/teahouse.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [21:51:21] RECOVERY - check_minfraud_primary on payments3 is OK: OK [21:51:21] RECOVERY - check_minfraud_secondary on payments3 is OK: OK [21:51:39] PROBLEM - Host appservers.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:52:10] AzaToth: so I guess that's an expected side effect of network troubles that are currently being diagnosed. [21:52:15] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:21] ok [21:52:24] PROBLEM - Host capella is DOWN: PING CRITICAL - Packet loss = 100% [21:52:24] (nice picture though.) [21:53:00] RECOVERY - Host appservers.svc.pmtpa.wmnet is UP: PING WARNING - Packet loss = 28%, RTA = 0.22 ms [21:53:09] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:53:09] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:53:17] unexpected side effect [21:53:33] maplebed: I was trying to evaluate it for Quality Images on commons, but only got 30% of it [21:53:36] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39248 bytes in 0.927 seconds [21:53:45] RECOVERY - Host api.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [21:53:58] mark: I see [21:54:03] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:54:12] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:54:21] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:54:21] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60495 bytes in 0.832 seconds [21:54:39] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49572 bytes in 0.735 seconds [21:54:48] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39240 bytes in 0.687 seconds [21:55:15] RECOVERY - check_minfraud_secondary on payments2 is OK: HTTP OK: HTTP/1.1 302 Found - 128 bytes in 0.415 second response time [21:55:15] RECOVERY - check_minfraud_secondary on payments4 is OK: HTTP OK: HTTP/1.1 302 Found - 128 bytes in 0.421 second response time [21:55:15] RECOVERY - check_minfraud_primary on payments2 is OK: HTTP OK: HTTP/1.1 302 Found - 128 bytes in 0.141 second response time [21:55:15] RECOVERY - check_minfraud_primary on payments4 is OK: HTTP OK: HTTP/1.1 302 Found - 128 bytes in 0.148 second response time [21:55:15] RECOVERY - Host capella is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:55:42] RECOVERY - NTP on emery is OK: NTP OK: Offset -0.04470491409 secs [21:57:30] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [21:58:51] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [21:59:34] en wiki seems somewhat broken - Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 21:59:19 GMT [22:01:33] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [22:01:51] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [22:02:00] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [22:02:18] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [22:02:18] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [22:02:45] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [22:02:45] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [22:02:45] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 3, down: 3, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [22:03:30] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49571 bytes in 1.985 seconds [22:03:39] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60502 bytes in 0.946 seconds [22:03:39] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39238 bytes in 0.790 seconds [22:04:06] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39250 bytes in 0.940 seconds [22:04:06] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49578 bytes in 1.049 seconds [22:04:15] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 6, down: 0, shutdown: 0 [22:04:42] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60495 bytes in 0.814 seconds [22:09:12] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [22:13:42] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [22:30:30] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [22:30:30] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [22:32:27] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [22:33:30] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [22:33:41] !log rebooted db36 for kernel upgrade [22:33:50] Logged the message, Master [22:34:24] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [22:34:24] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [22:35:27] PROBLEM - Host db36 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:24] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [22:37:24] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [22:37:24] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [22:37:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:37:24] RECOVERY - Host db36 is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [22:38:27] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [22:38:27] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [22:39:21] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [22:40:24] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [22:40:24] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [22:40:24] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [22:41:09] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 803 seconds [22:41:36] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 814 seconds [22:42:30] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [22:44:27] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [22:45:21] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [22:46:06] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:46:24] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [22:46:24] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [22:47:27] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [22:48:30] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [22:49:24] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [22:49:24] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [22:51:21] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [22:53:27] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [22:53:27] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [22:53:27] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [22:54:21] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [22:56:27] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [22:59:27] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [23:02:09] PROBLEM - NTP on db38 is CRITICAL: NTP CRITICAL: No response from NTP server [23:02:09] PROBLEM - NTP on db43 is CRITICAL: NTP CRITICAL: No response from NTP server [23:02:09] PROBLEM - NTP on db39 is CRITICAL: NTP CRITICAL: No response from NTP server [23:02:18] PROBLEM - NTP on db44 is CRITICAL: NTP CRITICAL: No response from NTP server [23:02:27] PROBLEM - NTP on db37 is CRITICAL: NTP CRITICAL: Offset unknown [23:02:27] PROBLEM - NTP on db51 is CRITICAL: NTP CRITICAL: Offset unknown [23:02:36] PROBLEM - NTP on db45 is CRITICAL: NTP CRITICAL: No response from NTP server [23:03:39] PROBLEM - NTP on db1039 is CRITICAL: NTP CRITICAL: Offset unknown [23:03:57] RECOVERY - NTP on db37 is OK: NTP OK: Offset 1.323223114e-05 secs [23:03:57] RECOVERY - NTP on db51 is OK: NTP OK: Offset -1.47819519e-05 secs [23:03:57] PROBLEM - NTP on db1007 is CRITICAL: NTP CRITICAL: Offset unknown [23:06:47] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:07:40] New patchset: preilly; "add corrected range(s) for saudi telecom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14004 [23:08:10] Ryan_Lane: can you approve this for me ^^ [23:08:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14004 [23:08:22] lemme see [23:08:37] Ryan_Lane: okay thanks [23:08:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14004 [23:08:53] RECOVERY - NTP on db1007 is OK: NTP OK: Offset 0.00180721283 secs [23:08:58] Ryan_Lane: thanks again [23:09:07] yw [23:16:59] RECOVERY - NTP on db1039 is OK: NTP OK: Offset 0.0008271932602 secs [23:19:50] RECOVERY - NTP on db39 is OK: NTP OK: Offset 0.0002022981644 secs [23:20:26] RECOVERY - NTP on db44 is OK: NTP OK: Offset 0.001883864403 secs [23:20:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [23:21:02] RECOVERY - NTP on db45 is OK: NTP OK: Offset 0.001336693764 secs [23:22:32] RECOVERY - NTP on db43 is OK: NTP OK: Offset 0.002466797829 secs [23:22:32] RECOVERY - NTP on db38 is OK: NTP OK: Offset 0.002821087837 secs [23:33:29] wtf is going on??? [23:33:48] I'm in Nicaragua, just sat on my laptop after a day and a half or so [23:33:53] I've gotten like 100 pages [23:34:45] which part would you like to know? [23:37:22] it's still paging [23:37:29] right now [23:37:35] so, I guess everything is /not/ ok? [23:38:34] AzaToth: so I guess that's an expected side effect of network troubles that are currently being diagnosed. [23:38:44] in response to AzaToth asking about the flapping [23:39:06] unexpected side effect [23:40:05] but it looks like it was back to normal about 1.5 hours ago [23:42:56] I'm getting pages right now [23:43:08] so, it probably wasn't solved [23:43:17] or it's a different issue [23:43:53] old messages i suspect paravoid [23:44:03] oh, could be [23:44:06] just catching up with you [23:44:06] let me check [23:44:46] being in a foreign timezone doesn't help with interpreting UTC times in SMSes [23:45:11] hm, indeed [23:45:18] old queued up messages it is [23:45:49] thanks CT [23:46:56] thank leslie andmark [23:47:57] writing the followup now [23:48:27] sorry about being not very responsive via irc, i was doing a screen sharing meeting with vendors + poking frantically at the routers [23:52:01] LeslieCarr: fwiw, I didn't hack the commit log :) [23:52:09] hahaha [23:52:12] ;) [23:52:12] and haven't committed anything [23:52:19] and it was definitely discard last week [23:52:28] really ? [23:52:32] how'd it become reject [23:52:33] definitely = I could see packets being discarded [23:52:37] this is puzzling [23:52:38] not getting rejects [23:52:47] asked mark, he said that he confirmed it being discard on the junipers [23:52:48] hrm... [23:52:51] yeah [23:52:52] I didn't check myself [23:53:14] so strange [23:53:22] it's possible there were lots of fallible memories [23:53:44] I'm 100% sure that when I tried pinging virt6/7/8 from e.g. sockpuppet [23:53:54] I wouldn't get a reply [23:54:03] (either echo reply or admin prohibited) [23:54:41] * Damianz gives LeslieCarr a cookie for fixing a little part of the internet [23:54:52] yay cookies [23:57:53] * maplebed did that too! [23:57:55] LeslieCarr: just read your mail. wow [23:58:10] hehe, maplebed did give me a cookie :) [23:58:19] it gave me the power to keep talking with vendor support [23:58:34] speaking of that, i never really ate lunch, going to find a quick snack