[00:00:11] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [00:00:11] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 17 seconds [00:00:38] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 15 seconds [00:00:38] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 1 seconds [00:00:38] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 1 seconds [00:00:56] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:01:59] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [00:02:17] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [00:02:17] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 18 seconds [00:03:02] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 25 seconds [00:03:47] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 13 seconds [00:04:23] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [00:05:44] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 4 seconds [00:07:41] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [00:07:50] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [00:09:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [00:15:02] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.005 second response time [00:15:56] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.011 second response time [00:24:56] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:32] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:51] New patchset: Tim Starling; "Reduce db32 read load to zero due to persistent lag" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13853 [00:35:18] New review: Tim Starling; "Already live." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/13853 [00:35:20] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13853 [00:37:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:35] PROBLEM - Host mw1093 is DOWN: PING CRITICAL - Packet loss = 100% [00:46:06] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [00:46:23] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.026 second response time [00:47:17] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.012 second response time [00:47:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [00:50:17] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [00:50:53] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:51:38] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [00:51:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [00:51:56] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [01:02:36] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [01:05:26] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [01:06:56] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:12:38] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:18:56] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.012 second response time [01:19:14] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.014 second response time [01:23:35] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:32:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:41] PROBLEM - Host mw1095 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [01:41:35] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [01:42:47] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 258 seconds [01:42:56] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:46:05] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [01:47:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [01:48:38] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 607s [01:51:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:52:59] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 34s [01:53:17] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 6 seconds [01:56:12] Hey wikimedia people. I'm with the company (your.org) that's been working with apergos on setting up the full off-site mirror of everything (http://ftpmirror.your.org and http://dumps.wikimedia.your.org). Today we got an automated notice from Google that both of those URLs were serving malware. Their lack of specificity is making this really difficult to narrow down, so before we tear everything apart, does anyone know if th [01:56:12] images/commons files are being scanned for anything nasty? Or is there any way that you guys could have flagged something as deleted but it's still being pushed to us? [01:58:51] I'm running clamav on the entire set of everything being mirrored. The only thing it's complained about so far is http://ftpmirror.your.org/pub/wikimedia/images/wiktionary/fj/c/c4/citibank-car-loan.pdf but nothing else is triggering on that file [02:07:05] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:08:26] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [02:10:50] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:11:36] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:11] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.013 second response time [02:21:56] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.010 second response time [02:28:32] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [02:28:32] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [02:30:38] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [02:31:32] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [02:32:35] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [02:32:35] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [02:35:35] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [02:36:38] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [02:36:38] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [02:37:32] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [02:38:44] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [02:40:14] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:40:32] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [02:42:38] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [02:43:32] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [02:44:35] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [02:44:35] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [02:45:38] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [02:46:32] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [02:47:35] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [02:47:36] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [02:49:32] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [02:51:38] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [02:52:33] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [02:53:35] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [02:54:56] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [02:57:38] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [02:57:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:00:35] Malware in Your.org o_O [03:07:06] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Mon Jul 2 03:06:52 UTC 2012 [03:15:38] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Jul 2 03:15:28 UTC 2012 [03:15:56] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [03:17:17] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [03:17:47] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:26:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:28:08] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:36:05] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:36:23] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:56] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.008 second response time [03:39:14] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.007 second response time [03:53:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:56:47] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [03:58:08] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [04:53:15] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 6, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [04:56:15] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [05:09:20] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [05:13:23] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [05:56:08] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:29:25] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [06:30:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:22:26] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 185 seconds [07:22:26] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 185 seconds [07:30:51] hello [07:31:50] yo [07:31:53] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [07:33:14] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [07:33:53] apergos: the labs went wild this weekend due to the leap sec bug [07:34:00] I heard [07:34:03] I was mostly not around [07:34:05] I had some funny bugs like mysql spiking to 100% cpu [07:34:12] same with automount [07:34:15] I had that n my laptop, restarted and it was fine [07:34:27] prod was fine from what I heard beside java search i think [07:34:28] (mysql). also some java thing, oh yeah it was tomcat [07:48:41] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [07:51:32] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:00:24] New patchset: Hashar; "(bug 37457) viwikibooks: fix import sources" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:01:06] back in a little bit, errands before it gets blisteringly hot [08:01:17] New review: Hashar; "Follow up : https://gerrit.wikimedia.org/r/13860" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [08:01:39] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:02:19] ohh Jenkins has troubles [08:02:21] :-D [08:02:37] yeah ksoftIRQ madness [08:02:52] apergos: I guess we need to reboot the gallium host :-D [08:03:40] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:05:28] is it "normal" for a load loaded server to do like 800k / 1M context switch per seconds? (according to vmstat) [08:06:31] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:06:33] that on the high side [08:06:34] :) [08:07:10] that's a lot, but it may be normal [08:07:22] whenever a thread blocks, it wants to run another one [08:07:43] now, the quesiton is what overhead do context switch have on that server? [08:07:59] I have no idea how to check that [08:08:26] user / system are at 30% / 30% each, top -i just show ksoftirqd [08:08:47] platonides: thats quite a lot of thrashing, I'm used to busy DBs under 200k ;-) [08:09:58] hashar, don't worry, I wasn't expecting you to magically give out a number [08:10:18] the box does nothing :-] [08:10:49] all the cpu usage is overhead? [08:10:57] I guess so [08:10:57] that doesn't look too efficient :) [08:11:20] apergos: would you mind rebooting gallium for me please ? https://rt.wikimedia.org/Ticket/Display.html?id=3208 [08:11:47] !log Stopped Jenkins on gallium. It is not doing anything anyway. Asked to reboot box {{rt|3208}} [08:11:58] Logged the message, Master [08:16:27] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13860 [08:20:39] ah puppet is sooo smart :-] [08:20:44] it restarted jenkins [08:20:45] \O/ [08:26:19] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [08:26:19] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [08:39:22] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:04:49] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [09:06:10] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [09:33:40] hashar, do you still need that reboot? [09:35:30] apergos: gallium ? yes :-) [09:35:45] restarting the java app is not enough apparently [09:35:46] :( [09:36:42] I restarted Jenkins on gallium but CPU is still high http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&c=Miscellaneous+eqiad&h=gallium.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [09:36:53] so definitely ksoftirq eating everything [09:37:50] ok [09:38:51] !Log rebooting gallium, it's pretty unhappy (maybe related to leap second issue) [09:38:54] grrr [09:39:04] !log rebooting gallium, it's pretty unhappy (maybe related to leap second issue) [09:39:15] Logged the message, Master [09:39:38] It's a long time after the leap second change but it is running java [09:40:31] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:41:38] let's see how that goes [09:42:19] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [09:43:40] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [10:10:34] apergos: works for me thanks for rebooting gallium :-] [10:10:43] sure [10:10:45] should have asked you to upgrade packages while you were at it but that will be for the next time :-] [10:31:03] PROBLEM - NTP on db12 is CRITICAL: NTP CRITICAL: Offset -1.042744517 secs [10:36:00] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [10:37:21] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [10:44:33] RECOVERY - NTP on db12 is OK: NTP OK: Offset 0.004649877548 secs [10:53:38] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [10:55:44] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [10:57:29] !log Problems on one of two pmtpa-eqiad waves; raised OSPF metric to 60 to failover traffic to the other link [10:57:40] Logged the message, Master [10:57:42] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [11:03:23] PROBLEM - Host mw1025 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:13:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:14:20] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [11:20:20] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: Device does not support ifTable - try without -I option [11:24:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:24:50] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:27:50] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:29:11] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [11:29:20] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [11:29:47] PROBLEM - Puppet freshness on tarin is CRITICAL: Puppet has not run in the last 10 hours [11:31:31] !log Now we have packet loss within pmtpa/sdtpa... reverting change [11:31:41] Logged the message, Master [11:31:53] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [11:34:55] the replag issues are known ? [11:36:48] TimStarling: Tim, can you do a little spam control on scribunto.wmflabs or give others who know how to do that some rights ? [11:41:03] there's some weird network issue going on [11:57:42] djhartman: I gave you bureacrat access [12:21:19] New patchset: Hashar; "Add a symbolic link to CREDITS for Change Ia02c3bcf." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:21:52] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:24:00] New review: Hashar; "deployed live" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [12:26:34] !log upgrading kernel on gallium [12:26:44] Logged the message, Master [12:27:25] PROBLEM - Host mw1044 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:48] !log rebooting gallium one more time to install kernel [12:27:58] Logged the message, Master [12:29:13] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [12:29:13] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [12:31:10] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [12:32:13] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [12:33:16] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [12:33:16] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [12:35:20] !log installing upgrades on fenari (linux-firmware linux-libc-dev..) [12:35:31] Logged the message, Master [12:36:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [12:36:16] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [12:37:10] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [12:37:10] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [12:38:13] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [12:39:16] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [12:41:13] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [12:43:10] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [12:44:13] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [12:44:49] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:45:16] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [12:45:16] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [12:45:39] pfff [12:45:52] mutante: seems db12 has some replag [12:46:06] I have no idea where to check it though [12:46:10] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [12:46:10] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:47:13] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [12:48:16] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [12:48:16] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [12:48:18] @replag db12 [12:48:18] Damianz: [db12: s1] db12: 1252s [12:49:54] well [12:50:13] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [12:50:38] apergos, how did you diagnose it last time? [12:50:49] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:51:54] how did I diagnose which? [12:51:54] ahhh http://noc.wikimedia.org/dbtree/ [12:51:55] ;) [12:52:10] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:52:10] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [12:52:10] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [12:52:10] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [12:52:14] someone else noticed it (and yes, that's an easy way to monitor them all) [12:52:27] the only thing I saw going on over there was the pupulatesha1 script [12:52:34] *oopulate [12:52:49] yup [12:52:55] it obviously doesn't pay attention to slave lag too much [12:53:13] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [12:53:44] what are the mysqldump running on db12 ? https://ishmael.wikimedia.org/more.php?host=db12&hours=24&checksum=7467891370387641567 [12:54:11] apergos, i mean when you blocked some ip requesting contribs en masse [12:55:08] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [12:56:14] hashar: DELETE /* LinksUpdate::incrTableUpdate Traveler100 */ FROM `templatelinks` ? [12:56:32] some jobrunner [12:56:33] sampled 1000 logs iirc [12:56:59] domas: the populateSha1 does wfWaitForSlave [12:57:14] well it isn't [12:57:28] should wait after each batch [12:57:39] how large is a batch? [12:58:07] that is the question ;) maybe it was made --batch=1 [12:58:13] obviously not enough [12:58:16] --batch = 100000000 [12:58:17] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [12:58:36] :-/ [12:59:01] bah ou [12:59:24] ahh populateRevisionSha1.php [12:59:28] running on hume [12:59:36] I was looking at another populate script [12:59:47] (that is on hume) [13:01:54] grgarougaema ze [13:02:02] they are running in screen sessions [13:02:05] with no log files [13:02:44] shall i tell VP/T that you guys are investigating the replag issue and point folks to server admin log ? [13:03:06] apergos: mutante the scripts have been running since June 29th. Not sure if they are the cause [13:03:55] they were already stopped once to let a slave catch up [13:03:59] see the server admin log [13:04:04] ohh [13:05:04] yeah, about 12 hours ago [13:06:45] so there is 4 occurrences of the script [13:06:49] each doing batches of 1000 [13:06:54] entris [13:07:13] so that is like 4k UPDATE [13:07:23] then wfWaitForSlaves [13:07:33] it's not that much data [13:07:35] * hashar kill -STOP aaron [13:09:30] hashar, he's in crontabs;) [13:11:35] is everywhere else done bar enwiki? [13:11:47] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 [13:12:15] Reedy: no idea sorry [13:12:48] needs faker :) [13:12:49] isn't that mysqldump overloading it --> https://ishmael.wikimedia.org/?host=db12 ? [13:13:05] that is what I was wondering [13:13:08] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 91, down: 0, dormant: 0, excluded: 0, unused: 0 [13:13:40] 2012-07-02 12:46:47 average time: 1938s mysqldump [13:13:48] it is not overloaded, btw [13:13:58] hehe [13:14:08] it's just being special and slow [13:14:15] it would need to have about 10x load to become overloaded for me :) [13:17:06] well, lag = not handling replication. sort of overloaded:) [13:19:17] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:21:19] hashar / reedy: you happen to know if this runs somewhere in cron already or is still needed in a cron? "maintenance/purgeParserCache.php" [13:21:30] well db12 is full of waiting for slave queries anyway [13:22:00] mutante: it should be still needed, we're still using db40 for mysql parser cache [13:22:08] as for if it's anywhere, no idea. Tim/Asher would know more [13:22:46] enwiki 0 | Creating tmp table | (SELECT /* SpecialRecentchangeslinked::doMainQuery XXXX */ `recentchanges`.*,ts_tags,fp_sta | [13:22:52] Reedy: yea, i see in a ticket that Tim once worked on it and ran it, but then it was open with the question "does it still need to be added to cron" [13:23:01] can't we shift load out of db12 ? [13:23:57] PROBLEM - Host mw1105 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:07] hashar: db12 is used for watchlist/rc etc [13:24:09] Reedy: i am guessing "on hume", but i wouldn't know which parameters and how often, gonna ask Asher. thx [13:24:11] oh db12 already down to 0 [13:24:43] maybe we have made a change in mw that introduce to much stress ? [13:24:52] I have seen a lot of "copying to tmp table" [13:26:02] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [13:26:05] hashar: i dunno but the "DELETE /* LinksUpdate::incrTableUpdate Traveler100.." is gone now [13:27:12] 1173 | Sending data | SELECT /*!40001 SQL_NO_CACHE */ * FROM `templatelinks` [13:27:23] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [13:28:07] on enwiki [13:29:33] well that select is just about 327Million rows [13:30:07] must be form mysqldump indeed [13:30:46] hi, need somebody from India or Indian language knowledge, especially "Assamese" to confirm if a bug is resolved. [13:31:16] i think i should try that on #wikipedia, heh [13:31:52] mutante: I know of santosh and yuvipanda (not sure they know Assasmese though) both offline [13:31:54] That sounds rather very much like vandalism lol [13:34:14] Damianz: heh, it was related to BZ 33507, but nobody reopened that.. so looks resolved [13:34:48] New patchset: Demon; "Various tweaks to gerrit.pp to get it running in labs:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13484 [13:35:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13484 [13:36:26] !log db12 suffering some 1400sec (and growing) replag. mysqldump in progress on that host. [13:36:36] Logged the message, Master [13:38:27] hashar: which servers would need "librsvg" to render svg? [13:38:42] image scalers at least [13:39:18] SVG rendering should be done through the thumbnailing infrastructure [13:39:59] alright, i see "librsvg2-bin" in imagescaler.pp [13:40:01] mutante: imagescaler::packages . Why do you ask? [13:40:13] 'db32' => 0, # snapshot host [13:40:14] because you asked about the version of it in some RT a few months ago [13:40:18] and it is still open [13:40:26] Reedy: local hack ? [13:40:36] lol, it's a comment in db.php [13:40:41] apergos would know [13:40:46] hashar: rsvg --version ;) [13:40:48] Reedy: the dump comes from snapshot4 yes [13:41:11] "The version in use is 2.26.3 as a WMF package" [13:41:15] *** 2.26.3-0wm1 0 [13:41:18] ..that was a while ago [13:41:34] lucid chips 2.26.3 too [13:41:36] "Our bug tracker has several bugs which might be fixed by upgrading that tool" [13:41:38] i am not sure what is in -wm1 [13:41:43] hmmm [13:41:58] https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/librsvg/debian/ [13:41:59] i hope wm1 are the fixes :) [13:42:11] I am pretty sure not :) [13:42:12] I am sorry to say it has been a long time since I looked at any of that [13:42:18] https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/librsvg/debian/changelog?view=markup [13:42:19] so what version and where... no idea [13:42:25] oh there are patches :-) [13:42:45] apergos: I wasn't actually pinging about the imagescalers. Just the host being used for the enwiki dumps [13:43:13] oh [13:43:16] mutante: the patch that matter is wikimedia-brand.patch [13:43:20] db32 is commented as the "snapshot host", but seems db12 is being used, aswell as for watchlists, rc etc [13:43:23] what about it, sorry? [13:43:39] 'dump' => array( [13:43:39] 'db12' => 1, [13:43:46] yeah it is dumping tables. these dumps hold no locks, [13:43:46] lying comment lies [13:44:08] which may not be awesome for the users of the data but [13:44:19] it means they should have minimal impact on the dbs [13:44:24] mmm [13:44:37] shame we can't use the idle eqiad slaves [13:44:49] well at some point maybe we will [13:44:55] (have an idle slave that gets used for them) [13:44:59] but in the meantime, ... [13:45:11]