[00:08:21] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Sun Jan 27 00:08:07 UTC 2013 [00:08:46] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:16:07] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [00:18:04] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [00:18:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 195 seconds [00:19:12] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 239 seconds [00:19:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [00:20:09] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [00:20:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:21:08] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [00:22:20] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [00:22:21] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [00:23:23] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [00:24:26] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [00:35:26] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [00:38:23] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [00:38:23] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [00:41:23] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [00:55:10] !log core dumping cr1-sdtpa - 1% chance of minor network issue [00:55:24] Logged the message, Mistress of the network gear. [00:59:59] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::0 [01:00:01] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::12 [01:00:01] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [01:00:08] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::6 [01:00:09] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [01:00:09] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [01:00:09] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:17] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [01:00:18] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [01:00:26] PROBLEM - Host wikinews-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:26] PROBLEM - Host wikiquote-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:27] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [01:00:27] PROBLEM - Host wikiversity-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:28] PROBLEM - Host wiktionary-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:44] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::9 [01:00:45] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::12 [01:00:45] mediawiki.org won't load [01:00:45] PROBLEM - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::13 [01:00:53] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [01:00:54] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [01:00:55] shit [01:00:55] ok [01:01:02] PROBLEM - Host bits-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:04] en wiki is offline too [01:01:11] PROBLEM - Host foundation-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.233) [01:01:12] PROBLEM - Host mediawiki-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.232) [01:01:12] PROBLEM - Host mobile-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:15] loads for me [01:01:21] i deactivated pim - reactivating pim now [01:01:29] techman224: are you in Europe or not europe ? [01:01:38] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::b [01:01:39] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::b [01:01:39] PROBLEM - Host m.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:40] North america [01:01:47] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::12 [01:01:48] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [01:01:49] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::0 [01:01:50] uhoh, can you please give me a traceroute [01:01:54] PROBLEM - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::13 [01:01:56] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [01:01:57] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::6 [01:01:57] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [01:01:58] PROBLEM - Host upload-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:01:59] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:59] PROBLEM - Host upload-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:02:04] :( [01:02:05] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [01:02:06] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [01:02:06] PROBLEM - Host wikivoyage-lb.eqiad.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::13 [01:02:06] PROBLEM - Host wikisource-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.229) [01:02:07] PROBLEM - Host wikipedia-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.225) [01:02:07] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:08] PROBLEM - Host wikimedia-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:02:08] PROBLEM - Host wikimedia-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:09] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:15] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [01:02:31] * Jasper_Deng is having no problems [01:02:33] PROBLEM - Host wiktionary-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:34] PROBLEM - Host wikisource-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:35] PROBLEM - Host wikipedia-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:35] PROBLEM - Host wikiversity-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:36] PROBLEM - Host wikivoyage-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:36] PROBLEM - Host wikinews-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:37] PROBLEM - Host wikiquote-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:41] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [01:02:50] !log rpd core caused bgp to restart (even though it wasn't supposed to) - causing routing churn [01:02:51] The whole cluster is going down [01:02:56] Logged the message, Mistress of the network gear. [01:02:59] PROBLEM - Host bits-lb.eqiad.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (208.80.154.234) [01:03:06] techman224/ Jasper_Deng this makes sense in that we withdrew and readvertised some routes - causing churn [01:03:12] this would make it so that some are having issues [01:03:17] techman224: can you send me a traceroute please ? [01:03:24] LeslieCarr, still running [01:03:27] that explains some - I think my tunnelbroker caches routes [01:03:49] I'm getting a bunch from us.above.net [01:03:58] RobH: real alarm but my fault [01:04:14] RECOVERY - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [01:04:16] LeslieCarr: ahh, so i dont need to do anything? [01:04:19] awesome =] [01:04:20] It's back up [01:04:20] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [01:04:21] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.40 ms [01:04:21] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.04 ms [01:04:24] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [01:04:29] RECOVERY - Host mobile-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.17 ms [01:04:30] RECOVERY - Host mediawiki-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:04:31] RECOVERY - Host foundation-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.19 ms [01:04:32] no need to do anything RobH [01:04:37] en.wikipedia.org load normally [01:04:38] RECOVERY - Host upload-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:04:41] RECOVERY - Host bits-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:04:41] RECOVERY - Host m.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:04:41] whew [01:04:41] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.33 ms [01:04:46] oh well, no one can say im not answering pages =] [01:04:48] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.40 ms [01:04:49] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.21 ms [01:04:56] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [01:04:57] RECOVERY - Host wikiquote-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [01:04:57] RECOVERY - Host wikiversity-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [01:04:57] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.05 ms [01:04:58] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.05 ms [01:05:05] RECOVERY - Host wikinews-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [01:05:06] thanks techman224 and Jasper_Deng for being so quick on this and RobH for jumping online so quickly [01:05:06] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:05:06] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.14 ms [01:05:06] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.37 ms [01:05:07] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.33 ms [01:05:07] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.06 ms [01:05:08] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.10 ms [01:05:14] RECOVERY - Host wiktionary-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [01:05:23] RECOVERY - Host wikipedia-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [01:05:24] RECOVERY - Host wikisource-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [01:05:26] Participation reward, yay! [01:05:50] RECOVERY - Host wikimedia-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [01:05:57] RobH: being at the right place at the right time ;) [01:06:57] LeslieCarr: yw - though I didn't actually experience any issues [01:07:04] RECOVERY - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:07:18] Jasper_Deng, are you tunnelling through labs? [01:07:28] techman224: no - I'm using IPv6 tunnel [01:07:29] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.12 ms [01:07:37] which apparently cached the routes [01:07:38] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [01:07:38] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.29 ms [01:07:39] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.36 ms [01:07:47] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.04 ms [01:07:48] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.06 ms [01:07:48] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:07:48] RECOVERY - Host upload-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:07:56] RECOVERY - Host wikivoyage-lb.eqiad.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [01:07:57] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:07:57] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:07:58] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.02 ms [01:07:58] RECOVERY - Host wikimedia-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [01:07:58] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [01:08:05] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.07 ms [01:08:23] RECOVERY - Host wikisource-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:08:24] RECOVERY - Host wiktionary-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:08:24] RECOVERY - Host wikipedia-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:08:24] RECOVERY - Host wikiversity-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [01:08:25] RECOVERY - Host wikivoyage-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.76 ms [01:08:26] RECOVERY - Host wikinews-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 27.00 ms [01:08:32] RECOVERY - Host wikiquote-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:08:34] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.15 ms [01:08:50] RECOVERY - Host bits-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [01:09:28] interesting factoid - that's also why icinga didn't page - nagios-wm is using cr1-sdtpa as it's gateway whereas icinga is using cr2-eqiad [01:09:33] the more you know [01:15:01] LeslieCarr: When you said 1% chance of minor, we know you all ment 99% change of taking the site down :P [01:16:30] hehe [01:30:56] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'deployed 93869b8d37b6b7999190b378cb6813f6317b0b10' [01:31:10] Logged the message, Master [01:33:07] LeslieCarr: she said 1% chance of minor problems, not the chance of major problems [01:33:20] that number was not specified [01:33:38] still where would the other 99% go? [01:33:43] gah, that was meant for Damianz [01:34:12] Jasper_Deng: to an occupy protest? [01:34:23] lol [01:34:33] * Damianz protests about Aaron|home keeping him awake at 2am [01:43:28] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [01:58:24] New patchset: Techman224; "(bug 44395) Allow bureaucrats to remove the translateadmin group on wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46011 [02:26:58] !log LocalisationUpdate completed (1.21wmf8) at Sun Jan 27 02:26:57 UTC 2013 [02:27:10] Logged the message, Master [02:48:47] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'quick debugging' [02:48:58] Logged the message, Master [02:50:58] * Aaron|home wonders if apergos is around [02:51:06] !log LocalisationUpdate completed (1.21wmf7) at Sun Jan 27 02:51:05 UTC 2013 [02:51:17] Logged the message, Master [02:55:45] !log aaron synchronized php-1.21wmf8/maintenance/runJobs.php [02:55:56] Logged the message, Master [03:25:35] 4:50 am local time? I bet not ;) [03:31:11] * Aaron|home is still baffled [03:31:17] Reedy: while you are here https://gerrit.wikimedia.org/r/#/c/46012/ [03:35:14] Reedy: and https://gerrit.wikimedia.org/r/#/c/46009/ [03:35:45] still don't know what the main problem is that causes dbname delisting from cache value to take way too long [03:35:54] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 184 seconds [03:35:55] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 185 seconds [03:36:04] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 193 seconds [03:36:05] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 194 seconds [03:36:05] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 196 seconds [03:36:09] the runners get stuck faffing around the high priority jobs and never doing anything else [03:36:24] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 215 seconds [03:36:25] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 215 seconds [03:36:34] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 223 seconds [03:36:35] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 229 seconds [03:36:35] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 230 seconds [03:36:44] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 239 seconds [03:36:58] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 249 seconds [03:37:04] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [03:37:05] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:37:07] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 257 seconds [03:37:25] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [03:37:25] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [03:37:25] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 248 seconds [03:37:52] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 302 seconds [03:37:53] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 302 seconds [03:37:53] Reedy: live hacking the code not to update the list but just de-list things seem to make it unstuck...though now no high priority jobs will be done ;) [03:37:54] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [03:38:01] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 311 seconds [03:38:28] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 312 seconds [03:38:34] RECOVERY - MySQL Slave Delay on db37 is OK: OK replication delay 0 seconds [03:38:54] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [03:38:55] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [03:39:04] RECOVERY - MySQL Slave Delay on db37 is OK: OK replication delay 0 seconds [03:39:40] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [03:40:16] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [03:40:28] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [03:40:48] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 225 seconds [03:40:58] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [03:40:59] PROBLEM - MySQL Slave Delay on db68 is CRITICAL: CRIT replication delay 202 seconds [03:41:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [03:41:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [03:41:18] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [03:41:18] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 249 seconds [03:41:46] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 242 seconds [03:41:46] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 277 seconds [03:41:58] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [03:41:59] RECOVERY - MySQL Slave Delay on db68 is OK: OK replication delay 0 seconds [03:42:22] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [03:42:29] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [03:42:49] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:43:07] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [03:43:34] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:45:13] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [03:45:18] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [03:45:22] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [03:45:48] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [04:01:18] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 18 seconds [04:01:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:02:10] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:02:19] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:09:48] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [04:12:13] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [04:15:23] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [04:15:53] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 212 seconds [04:15:58] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 213 seconds [04:16:07] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [04:16:23] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [04:19:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 18 seconds [04:19:34] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 27 seconds [04:19:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 10 seconds [04:19:54] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:34:38] Reedy: enotifNotify: 75 queued; 3258 acquired [04:34:42] commonswiki, lol :) [04:47:24] !log aaron synchronized php-1.21wmf8/maintenance/runJobs.php [04:47:36] Logged the message, Master [04:48:40] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'removed live hack' [04:48:50] Logged the message, Master [04:54:00] and back to before, meh [04:54:54] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [04:55:23] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [04:55:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [04:56:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [04:57:11] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:57:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:57:54] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:57:56] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:39:38] Aaron|home: still around? (still need assistance?) [06:40:42] apergos: some jq debugging, the runners are doing nothing but high priority jobs (starving out the others) [06:41:59] huh [06:42:24] apergos: btw, can you cr https://gerrit.wikimedia.org/r/#/c/46014/2 ? [06:42:26] these will be the eqiad job runners now I guess [06:42:42] yep [06:43:00] lemme look at the gerrit commit first [06:43:19] (I'm still pretty sleepy but this seems urgent enough, not signing up for anything too huge though) [06:44:36] apergos: isn't it morning there? [06:44:48] I went to bed at about 3 am [06:45:05] it's now 8:44 but I "woke up" at 7:30 and couldn't gget back to sleep [06:54:03] New patchset: Aaron Schulz; "Fixed low priority job queue starvation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46017 [06:54:19] New review: Aaron Schulz; "Requires https://gerrit.wikimedia.org/r/#/c/46014/2" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/46017 [06:55:41] apergos: :( [06:55:49] I see it [06:55:57] I'm still looking at your change very slowly (sorry) [06:56:30] god it sucks to commit to puppet with windows [06:57:14] ugh, sorry [06:57:55] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [06:58:34] is it possible for $type to be false at the end there where you $this->output( $db . " " . $type . "\n" ); even though $types was set? [07:02:49] also (loking at second changeset) is fixdoublerediect now just ogne? [07:03:42] I can't see that being possible [07:04:05] apergos: those jobs were for a disabled feature [07:04:12] ok great [07:04:30] admins could have redirects fixed on page move [07:04:38] ohh [07:04:43] apparently cause problems with clever vandalism :( [07:04:51] dang! [07:04:57] this was a long time ago [07:04:57] clever vandals are clever [07:07:41] timestamp lt started? when would that happen? [07:08:25] sanity, in case the clock jumps way back [07:08:35] should never be hit [07:08:36] heh [07:08:46] I guessed it was something like that, just making sure [07:10:26] seems ok (tell me you tested these locally please though) [07:10:59] Aaron|home: [07:11:00] the nextJobDB one yes [07:11:13] I tested bits of bash for the other one, but not the whole thing [07:11:43] well we'll find out by live testing [07:12:03] keep in mind the queue is basically totally broken now [07:12:08] yep [07:12:18] so it if it breaks and has to be fixed for a few min, it's not much worse [07:12:19] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46017 [07:12:34] apergos: hold off on a puppet run [07:12:54] oh? [07:13:13] * apergos pauses on sockpuppet... merge or no merge? [07:13:34] Aaron|home: [07:14:13] apergos: I need to deploy one thing first [07:14:36] ok [07:15:55] oh, the one change was in core of course [07:16:03] * apergos is sleepy, did I mention that? :-P [07:16:04] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'deployed f0f869dfabc1cedcedd5df972fbe6c279789a498 ' [07:16:16] Logged the message, Master [07:16:57] set? [07:17:02] apergos: ok [07:17:22] done [07:17:24] lemme find a job runner to experiment on [07:18:25] still the old code [07:18:44] er? [07:19:16] ok [07:20:28] anyway the php part looks fine [07:21:01] well here I am on mw1014 [07:21:26] with the new script and not seeing any job run via ps [07:21:38] do you see any output in the jub runner log from mw1014? [07:21:43] *job [07:22:16] sec [07:22:41] apergos: no, just silly bash errors [07:22:48] ugh [07:22:50] about what I expected ;) [07:22:53] figurs [07:23:06] well since you're looking at em wanna fix em? :-D [07:23:51] apergos: sure [07:24:09] * Aaron|home removes "local" [07:25:09] :-D [07:25:13] ah yeah I read right past that [07:26:09] while [ "$morehpjobs" -eq "y" ]; do [07:26:29] so that works better when things are integers [07:26:44] I can never remember which are string comparisons and which are numerical [07:26:48] I always have to look it up [07:27:00] (this is why I asked if you'd tested *cough*) [07:27:06] * Aaron|home does some local git bash testing [07:27:27] apergos: I didn't test *those* bits just the stuff that seemed iffy, like array stuff and such :) [07:27:28] (really I always steal from my old bash scripts is how it goes) [07:27:31] hahaha [07:28:51] while [[ "$morehpjobs" == "y" ]]; do [07:29:00] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:29:02] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:29:08] apergos: can't forget [ vs [[ :) [07:29:27] * Aaron|home readies to commit [07:31:31] * apergos waits [07:32:17] apergos: you might have to make the changes [07:32:26] why is that? [07:32:29] my repo is getting more fucked up [07:33:10] reset it or something [07:33:47] * Aaron|home tries again [07:33:57] I did reset before, gives an error [07:34:02] sort of goes through though [07:34:05] ugh no way [07:34:51] I think it worked [07:35:10] New patchset: Aaron Schulz; "Fixed stupid bugs with the last commit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46019 [07:35:14] apergos: ^ [07:35:23] yay [07:35:30] just getting ready to edit [07:35:40] the problem for me is files/ssl/*.wikimedia.org.crt [07:35:47] windows does not allow * in files [07:36:01] oohhh [07:36:06] so I can't use commit -a, since it will add the deletion of that file [07:36:13] oh boy [07:36:17] so I have to use git add on the file I want and commit just that stuff [07:36:21] yep [07:36:23] and git review has to have -R [07:36:29] don't use git review!! [07:36:29] or the rebase will fail and it will abort [07:36:30] :-P [07:37:54] * Aaron|home is on mw1014 [07:38:38] so that -lt for the timestamps... gonna work or not? [07:39:00] Aaron|home: [07:39:09] well those actually are integers [07:39:22] well they are actually strings [07:39:47] hrm [07:40:00] $ if [ "$timestamp" -lt "$started" ]; then echo 1; fi [07:40:02] 1 [07:40:24] ok great [07:40:38] apergos: yeah that is wonk [07:40:39] that's proof enough for me [07:40:52] that should print nothing [07:41:02] * Aaron|home removes the quotes [07:41:06] oh you hadn't asssigned values? [07:42:04] I did [07:42:16]