[00:08:21] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Sun Jan 27 00:08:07 UTC 2013 [00:08:46] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:16:07] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [00:18:04] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [00:18:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 195 seconds [00:19:12] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 239 seconds [00:19:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [00:20:09] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [00:20:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:21:08] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [00:22:20] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [00:22:21] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [00:23:23] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [00:24:26] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [00:35:26] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [00:38:23] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [00:38:23] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [00:41:23] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [00:55:10] !log core dumping cr1-sdtpa - 1% chance of minor network issue [00:55:24] Logged the message, Mistress of the network gear. [00:59:59] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::0 [01:00:01] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::12 [01:00:01] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [01:00:08] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::6 [01:00:09] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [01:00:09] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [01:00:09] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:17] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [01:00:18] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [01:00:26] PROBLEM - Host wikinews-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:26] PROBLEM - Host wikiquote-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:27] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [01:00:27] PROBLEM - Host wikiversity-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:28] PROBLEM - Host wiktionary-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:00:44] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::9 [01:00:45] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::12 [01:00:45] mediawiki.org won't load [01:00:45] PROBLEM - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::13 [01:00:53] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [01:00:54] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [01:00:55] shit [01:00:55] ok [01:01:02] PROBLEM - Host bits-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:04] en wiki is offline too [01:01:11] PROBLEM - Host foundation-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.233) [01:01:12] PROBLEM - Host mediawiki-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.232) [01:01:12] PROBLEM - Host mobile-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:15] loads for me [01:01:21] i deactivated pim - reactivating pim now [01:01:29] techman224: are you in Europe or not europe ? [01:01:38] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::b [01:01:39] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::b [01:01:39] PROBLEM - Host m.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:40] North america [01:01:47] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::12 [01:01:48] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [01:01:49] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::0 [01:01:50] uhoh, can you please give me a traceroute [01:01:54] PROBLEM - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::13 [01:01:56] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [01:01:57] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::6 [01:01:57] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [01:01:58] PROBLEM - Host upload-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:01:59] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:59] PROBLEM - Host upload-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:02:04] :( [01:02:05] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [01:02:06] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [01:02:06] PROBLEM - Host wikivoyage-lb.eqiad.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::13 [01:02:06] PROBLEM - Host wikisource-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.229) [01:02:07] PROBLEM - Host wikipedia-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.225) [01:02:07] PROBLEM - Host wikibooks-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:08] PROBLEM - Host wikimedia-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:02:08] PROBLEM - Host wikimedia-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:09] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:15] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [01:02:31] * Jasper_Deng is having no problems [01:02:33] PROBLEM - Host wiktionary-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:34] PROBLEM - Host wikisource-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:35] PROBLEM - Host wikipedia-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:35] PROBLEM - Host wikiversity-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:36] PROBLEM - Host wikivoyage-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:36] PROBLEM - Host wikinews-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:37] PROBLEM - Host wikiquote-lb.eqiad.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [01:02:41] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [01:02:50] !log rpd core caused bgp to restart (even though it wasn't supposed to) - causing routing churn [01:02:51] The whole cluster is going down [01:02:56] Logged the message, Mistress of the network gear. [01:02:59] PROBLEM - Host bits-lb.eqiad.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (208.80.154.234) [01:03:06] techman224/ Jasper_Deng this makes sense in that we withdrew and readvertised some routes - causing churn [01:03:12] this would make it so that some are having issues [01:03:17] techman224: can you send me a traceroute please ? [01:03:24] LeslieCarr, still running [01:03:27] that explains some - I think my tunnelbroker caches routes [01:03:49] I'm getting a bunch from us.above.net [01:03:58] RobH: real alarm but my fault [01:04:14] RECOVERY - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [01:04:16] LeslieCarr: ahh, so i dont need to do anything? [01:04:19] awesome =] [01:04:20] It's back up [01:04:20] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [01:04:21] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.40 ms [01:04:21] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.04 ms [01:04:24] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [01:04:29] RECOVERY - Host mobile-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.17 ms [01:04:30] RECOVERY - Host mediawiki-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:04:31] RECOVERY - Host foundation-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.19 ms [01:04:32] no need to do anything RobH [01:04:37] en.wikipedia.org load normally [01:04:38] RECOVERY - Host upload-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:04:41] RECOVERY - Host bits-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:04:41] RECOVERY - Host m.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:04:41] whew [01:04:41] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.33 ms [01:04:46] oh well, no one can say im not answering pages =] [01:04:48] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.40 ms [01:04:49] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.21 ms [01:04:56] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [01:04:57] RECOVERY - Host wikiquote-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [01:04:57] RECOVERY - Host wikiversity-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [01:04:57] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.05 ms [01:04:58] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.05 ms [01:05:05] RECOVERY - Host wikinews-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [01:05:06] thanks techman224 and Jasper_Deng for being so quick on this and RobH for jumping online so quickly [01:05:06] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:05:06] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.14 ms [01:05:06] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.37 ms [01:05:07] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.33 ms [01:05:07] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.06 ms [01:05:08] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 113.10 ms [01:05:14] RECOVERY - Host wiktionary-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [01:05:23] RECOVERY - Host wikipedia-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [01:05:24] RECOVERY - Host wikisource-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [01:05:26] Participation reward, yay! [01:05:50] RECOVERY - Host wikimedia-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [01:05:57] RobH: being at the right place at the right time ;) [01:06:57] LeslieCarr: yw - though I didn't actually experience any issues [01:07:04] RECOVERY - Host wikivoyage-lb.pmtpa.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:07:18] Jasper_Deng, are you tunnelling through labs? [01:07:28] techman224: no - I'm using IPv6 tunnel [01:07:29] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.12 ms [01:07:37] which apparently cached the routes [01:07:38] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [01:07:38] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.29 ms [01:07:39] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.36 ms [01:07:47] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.04 ms [01:07:48] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.06 ms [01:07:48] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:07:48] RECOVERY - Host upload-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:07:56] RECOVERY - Host wikivoyage-lb.eqiad.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [01:07:57] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.12 ms [01:07:57] RECOVERY - Host wikibooks-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [01:07:58] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.02 ms [01:07:58] RECOVERY - Host wikimedia-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [01:07:58] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [01:08:05] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.07 ms [01:08:23] RECOVERY - Host wikisource-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:08:24] RECOVERY - Host wiktionary-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:08:24] RECOVERY - Host wikipedia-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:08:24] RECOVERY - Host wikiversity-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [01:08:25] RECOVERY - Host wikivoyage-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.76 ms [01:08:26] RECOVERY - Host wikinews-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 27.00 ms [01:08:32] RECOVERY - Host wikiquote-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:08:34] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.15 ms [01:08:50] RECOVERY - Host bits-lb.eqiad.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [01:09:28] interesting factoid - that's also why icinga didn't page - nagios-wm is using cr1-sdtpa as it's gateway whereas icinga is using cr2-eqiad [01:09:33] the more you know [01:15:01] LeslieCarr: When you said 1% chance of minor, we know you all ment 99% change of taking the site down :P [01:16:30] hehe [01:30:56] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'deployed 93869b8d37b6b7999190b378cb6813f6317b0b10' [01:31:10] Logged the message, Master [01:33:07] LeslieCarr: she said 1% chance of minor problems, not the chance of major problems [01:33:20] that number was not specified [01:33:38] still where would the other 99% go? [01:33:43] gah, that was meant for Damianz [01:34:12] Jasper_Deng: to an occupy protest? [01:34:23] lol [01:34:33] * Damianz protests about Aaron|home keeping him awake at 2am [01:43:28] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [01:58:24] New patchset: Techman224; "(bug 44395) Allow bureaucrats to remove the translateadmin group on wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46011 [02:26:58] !log LocalisationUpdate completed (1.21wmf8) at Sun Jan 27 02:26:57 UTC 2013 [02:27:10] Logged the message, Master [02:48:47] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'quick debugging' [02:48:58] Logged the message, Master [02:50:58] * Aaron|home wonders if apergos is around [02:51:06] !log LocalisationUpdate completed (1.21wmf7) at Sun Jan 27 02:51:05 UTC 2013 [02:51:17] Logged the message, Master [02:55:45] !log aaron synchronized php-1.21wmf8/maintenance/runJobs.php [02:55:56] Logged the message, Master [03:25:35] 4:50 am local time? I bet not ;) [03:31:11] * Aaron|home is still baffled [03:31:17] Reedy: while you are here https://gerrit.wikimedia.org/r/#/c/46012/ [03:35:14] Reedy: and https://gerrit.wikimedia.org/r/#/c/46009/ [03:35:45] still don't know what the main problem is that causes dbname delisting from cache value to take way too long [03:35:54] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 184 seconds [03:35:55] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 185 seconds [03:36:04] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 193 seconds [03:36:05] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 194 seconds [03:36:05] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 196 seconds [03:36:09] the runners get stuck faffing around the high priority jobs and never doing anything else [03:36:24] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 215 seconds [03:36:25] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 215 seconds [03:36:34] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 223 seconds [03:36:35] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 229 seconds [03:36:35] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 230 seconds [03:36:44] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 239 seconds [03:36:58] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 249 seconds [03:37:04] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [03:37:05] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:37:07] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 257 seconds [03:37:25] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [03:37:25] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [03:37:25] PROBLEM - MySQL Slave Delay on db37 is CRITICAL: CRIT replication delay 248 seconds [03:37:52] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 302 seconds [03:37:53] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 302 seconds [03:37:53] Reedy: live hacking the code not to update the list but just de-list things seem to make it unstuck...though now no high priority jobs will be done ;) [03:37:54] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [03:38:01] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 311 seconds [03:38:28] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 312 seconds [03:38:34] RECOVERY - MySQL Slave Delay on db37 is OK: OK replication delay 0 seconds [03:38:54] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [03:38:55] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [03:39:04] RECOVERY - MySQL Slave Delay on db37 is OK: OK replication delay 0 seconds [03:39:40] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [03:40:16] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [03:40:28] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [03:40:48] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 225 seconds [03:40:58] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [03:40:59] PROBLEM - MySQL Slave Delay on db68 is CRITICAL: CRIT replication delay 202 seconds [03:41:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [03:41:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [03:41:18] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [03:41:18] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 249 seconds [03:41:46] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 242 seconds [03:41:46] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 277 seconds [03:41:58] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [03:41:59] RECOVERY - MySQL Slave Delay on db68 is OK: OK replication delay 0 seconds [03:42:22] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [03:42:29] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [03:42:49] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:43:07] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [03:43:34] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:45:13] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [03:45:18] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [03:45:22] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [03:45:48] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [04:01:18] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 18 seconds [04:01:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:02:10] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:02:19] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:09:48] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [04:12:13] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [04:15:23] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [04:15:53] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 212 seconds [04:15:58] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 213 seconds [04:16:07] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [04:16:23] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [04:19:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 18 seconds [04:19:34] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 27 seconds [04:19:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 10 seconds [04:19:54] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:34:38] Reedy: enotifNotify: 75 queued; 3258 acquired [04:34:42] commonswiki, lol :) [04:47:24] !log aaron synchronized php-1.21wmf8/maintenance/runJobs.php [04:47:36] Logged the message, Master [04:48:40] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'removed live hack' [04:48:50] Logged the message, Master [04:54:00] and back to before, meh [04:54:54] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [04:55:23] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [04:55:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [04:56:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [04:57:11] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:57:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:57:54] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:57:56] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:39:38] Aaron|home: still around? (still need assistance?) [06:40:42] apergos: some jq debugging, the runners are doing nothing but high priority jobs (starving out the others) [06:41:59] huh [06:42:24] apergos: btw, can you cr https://gerrit.wikimedia.org/r/#/c/46014/2 ? [06:42:26] these will be the eqiad job runners now I guess [06:42:42] yep [06:43:00] lemme look at the gerrit commit first [06:43:19] (I'm still pretty sleepy but this seems urgent enough, not signing up for anything too huge though) [06:44:36] apergos: isn't it morning there? [06:44:48] I went to bed at about 3 am [06:45:05] it's now 8:44 but I "woke up" at 7:30 and couldn't gget back to sleep [06:54:03] New patchset: Aaron Schulz; "Fixed low priority job queue starvation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46017 [06:54:19] New review: Aaron Schulz; "Requires https://gerrit.wikimedia.org/r/#/c/46014/2" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/46017 [06:55:41] apergos: :( [06:55:49] I see it [06:55:57] I'm still looking at your change very slowly (sorry) [06:56:30] god it sucks to commit to puppet with windows [06:57:14] ugh, sorry [06:57:55] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [06:58:34] is it possible for $type to be false at the end there where you $this->output( $db . " " . $type . "\n" ); even though $types was set? [07:02:49] also (loking at second changeset) is fixdoublerediect now just ogne? [07:03:42] I can't see that being possible [07:04:05] apergos: those jobs were for a disabled feature [07:04:12] ok great [07:04:30] admins could have redirects fixed on page move [07:04:38] ohh [07:04:43] apparently cause problems with clever vandalism :( [07:04:51] dang! [07:04:57] this was a long time ago [07:04:57] clever vandals are clever [07:07:41] timestamp lt started? when would that happen? [07:08:25] sanity, in case the clock jumps way back [07:08:35] should never be hit [07:08:36] heh [07:08:46] I guessed it was something like that, just making sure [07:10:26] seems ok (tell me you tested these locally please though) [07:10:59] Aaron|home: [07:11:00] the nextJobDB one yes [07:11:13] I tested bits of bash for the other one, but not the whole thing [07:11:43] well we'll find out by live testing [07:12:03] keep in mind the queue is basically totally broken now [07:12:08] yep [07:12:18] so it if it breaks and has to be fixed for a few min, it's not much worse [07:12:19] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46017 [07:12:34] apergos: hold off on a puppet run [07:12:54] oh? [07:13:13] * apergos pauses on sockpuppet... merge or no merge? [07:13:34] Aaron|home: [07:14:13] apergos: I need to deploy one thing first [07:14:36] ok [07:15:55] oh, the one change was in core of course [07:16:03] * apergos is sleepy, did I mention that? :-P [07:16:04] !log aaron synchronized php-1.21wmf8/maintenance/nextJobDB.php 'deployed f0f869dfabc1cedcedd5df972fbe6c279789a498 ' [07:16:16] Logged the message, Master [07:16:57] set? [07:17:02] apergos: ok [07:17:22] done [07:17:24] lemme find a job runner to experiment on [07:18:25] still the old code [07:18:44] er? [07:19:16] ok [07:20:28] anyway the php part looks fine [07:21:01] well here I am on mw1014 [07:21:26] with the new script and not seeing any job run via ps [07:21:38] do you see any output in the jub runner log from mw1014? [07:21:43] *job [07:22:16] sec [07:22:41] apergos: no, just silly bash errors [07:22:48] ugh [07:22:50] about what I expected ;) [07:22:53] figurs [07:23:06] well since you're looking at em wanna fix em? :-D [07:23:51] apergos: sure [07:24:09] * Aaron|home removes "local" [07:25:09] :-D [07:25:13] ah yeah I read right past that [07:26:09] while [ "$morehpjobs" -eq "y" ]; do [07:26:29] so that works better when things are integers [07:26:44] I can never remember which are string comparisons and which are numerical [07:26:48] I always have to look it up [07:27:00] (this is why I asked if you'd tested *cough*) [07:27:06] * Aaron|home does some local git bash testing [07:27:27] apergos: I didn't test *those* bits just the stuff that seemed iffy, like array stuff and such :) [07:27:28] (really I always steal from my old bash scripts is how it goes) [07:27:31] hahaha [07:28:51] while [[ "$morehpjobs" == "y" ]]; do [07:29:00] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:29:02] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [07:29:08] apergos: can't forget [ vs [[ :) [07:29:27] * Aaron|home readies to commit [07:31:31] * apergos waits [07:32:17] apergos: you might have to make the changes [07:32:26] why is that? [07:32:29] my repo is getting more fucked up [07:33:10] reset it or something [07:33:47] * Aaron|home tries again [07:33:57] I did reset before, gives an error [07:34:02] sort of goes through though [07:34:05] ugh no way [07:34:51] I think it worked [07:35:10] New patchset: Aaron Schulz; "Fixed stupid bugs with the last commit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46019 [07:35:14] apergos: ^ [07:35:23] yay [07:35:30] just getting ready to edit [07:35:40] the problem for me is files/ssl/*.wikimedia.org.crt [07:35:47] windows does not allow * in files [07:36:01] oohhh [07:36:06] so I can't use commit -a, since it will add the deletion of that file [07:36:13] oh boy [07:36:17] so I have to use git add on the file I want and commit just that stuff [07:36:21] yep [07:36:23] and git review has to have -R [07:36:29] don't use git review!! [07:36:29] or the rebase will fail and it will abort [07:36:30] :-P [07:37:54] * Aaron|home is on mw1014 [07:38:38] so that -lt for the timestamps... gonna work or not? [07:39:00] Aaron|home: [07:39:09] well those actually are integers [07:39:22] well they are actually strings [07:39:47] hrm [07:40:00] $ if [ "$timestamp" -lt "$started" ]; then echo 1; fi [07:40:02] 1 [07:40:24] ok great [07:40:38] apergos: yeah that is wonk [07:40:39] that's proof enough for me [07:40:52] that should print nothing [07:41:02] * Aaron|home removes the quotes [07:41:06] oh you hadn't asssigned values? [07:42:04] I did [07:42:16] oh wait [07:42:40] nevermind, one of them was not assigned [07:43:03] apergos: I guess that is fine then [07:43:15] ok good (I tested while you were looking at it) [07:43:40] New review: ArielGlenn; "sleepy reviewer is sleepy" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/46019 [07:43:40] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46019 [07:44:33] wait for it [07:45:21] something is running [07:45:32] apergos: does bash remind you of perl a little? [07:45:43] not so much [07:45:59] I do write random perl python and bash commands together in any script I write though [07:46:10] not so much php, that tends not to creep in [07:46:23] maybe it's just the fact that it gets highlighted the same with my editors and looks messy in the same ways [07:46:55] hahaha [07:48:08] how long til we should see some refreshlinks jobs showing up? [07:48:15] ok, so it's faster though the bug that lead to the starvation still feels like its there, but at least the time limit will be hit and some other jobs will get done [07:48:31] like 15 min ;) [07:48:35] ah [07:48:41] well not going to wait around for that [07:48:59] the rest of em will be ok within the next half hour then [07:49:41] apergos: what if all the bugs are in that part of the code? :) [07:50:13] you'll be cursing either me or yourself later :-D [07:50:18] or both! [07:50:37] * apergos runs puppetd on a couple more job runners out of impatience anyways [07:51:13] I still think the db delisting in nextJobDB does not work sometimes [07:51:31] swesme [07:52:01] but not always...I tested disabled refreshes of pending list and just let things get delisted and eventually the runners broke out of the high priority loop because those dbs where removed from the list since their queues were empty [07:52:25] but it's like it's not happening always so runners keep picking the same empty queues [07:52:37] irritating [07:52:42] seems like it started around the wmf8 deploy on the 23rd [07:52:59] * Aaron|home has been starring at the code changes there [07:53:09] what changed there, I was gonna ask [07:53:27] like srwiki and uzwiki keep getting pulled in jobs-loop [07:53:33] * Aaron|home watches the terminal [07:53:43] nothing gets run, so you'd think they'd get delisted [07:53:50] yep [07:54:24] sometimes the same wiki gets hit twice in a row [07:54:39] if it's still got stuff, fine [07:54:43] it could be races sometimes but not so often like this [07:55:21] generally, if you call nextJobDB and run that wiki, there should be jobs run (aside from sometimes getting raced out be another runner) [07:55:37] right [07:56:00] but it will only run some of them, there can be more left [07:56:06] the other possibility was a queue isEmpty() function returning false for now reason [07:56:10] but you mean twice in a row when there is nothing run? [07:56:14] I could not see any of that [07:56:32] apergos: yes [07:56:41] that's bad all right [07:58:16] apergos: do you see anything in nextJobDB that might cause that? [07:58:24] lemme look [07:59:28] it's not like it never happens, I know it does delist, but something is off about it [08:02:11] now it's doing refreshLinks [08:02:32] I guess that works then [08:02:42] isEmpty looks pretty straightforward [08:02:46] apergos: is it on all runners now? [08:02:49] not much of a way for it to go awry [08:03:07] might be 1 or 2 that have a couple minutes before the next puppet but the vast moajority have it [08:03:58] * apergos stares at delistDB some [08:04:49] also not that pop() sets the cache key to true if it finds nothing, which is what must be happening when nothing is run [08:04:52] *note [08:05:32] getCacheKey() is properly prefixed [08:08:19] the way to find out if it is really truly broken is to have it log when it gets into this section [08:08:22] if ( !$this->checkJob( $type, $db ) ) { [08:09:02] 'delisting blah' or whatever [08:09:16] heh, I did that in temp-debug.log [08:09:23] ah hah [08:09:23] it had a stream of entries [08:09:34] even for the case when it could get the lock [08:09:50] and I was var_dump()'ing the keys in eval.php [08:10:07] mmm [08:10:17] the problem is not contention (which never showed in the log) [08:10:26] nor the delisting never happening [08:10:32] but clearly something is wonk [08:10:35] no we know it happens sometimes, like you say [08:11:31] uugghh [08:11:45] I have to eat fast (which is impossible for me) and then get going, I have to be somewhere in 15 mins :-( [08:12:04] are things unbroken enough that I can leave this? [08:12:15] apergos: it's a little better than before [08:12:17] I mean if it is running the refreshlinks every so often at least [08:12:19] I guess it's fine [08:12:36] you could shorten that interval if you wanted, I dunno [08:12:46] apergos: as a hack, maybe [08:13:00] it could work well [08:13:31] if you do that I'll stick around long enough to push that out [08:13:41] * apergos looks for food that can be eaten fast [08:14:08] found: tea biscuits and nutella :-P [08:14:57] ok [08:16:34] apergos: what is a good lpmaxdelay value? 1 minute? [08:17:06] hmm longer I woud think [08:17:14] no? [08:17:36] I woulda given like 3 mins [08:17:49] well hpmaxdelay is 3 min [08:17:53] oh woops [08:17:55] ugh [08:18:17] so it would be 1:3 then [08:18:43] that time it spends usefully for enotifNotify is actually quite short [08:18:52] ok [08:18:58] well let's try it and see how it looks [08:19:07] guess I'll stick around a bit longer :-D [08:20:09] New patchset: Aaron Schulz; "Temporary hack to run low-priority jobs more." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46020 [08:20:23] apergos: ^ [08:20:44] I see it [08:21:12] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46020 [08:21:55] * Aaron|home wanted to play simcity today (really yesterday technically) [08:22:06] oh, well, next time [08:22:14] (the closed beta) [08:22:39] rats [08:23:21] wiki=nowiki always freaks me out a bit [08:24:14] 'm a bit baffled about what's being run now [08:24:17] (mw1014 again) [08:25:45] do we have a script that will give us a snapshot of the state of the job queue (which dbs have how many of what pending)? [08:26:42] you could do mwscriptwikiset showJobs.php all.dblist --group [08:27:18] * apergos maks a note of that [08:28:23] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:28:32] well anyways it is running both types of jobs frequently for now [08:30:54] apergos: I think things will be okish for now, thanks! [08:31:12] thank you [08:31:48] see you later! [08:32:44] bye [08:33:38] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:10:05] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Sun Jan 27 09:09:38 UTC 2013 [09:45:22] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [09:45:40] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [09:46:02] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [09:46:07] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [09:51:03] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 20 seconds [09:51:13] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 20 seconds [09:51:22] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 21 seconds [09:51:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 22 seconds [10:08:52] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:17:15] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [10:19:12] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [10:22:12] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [10:23:16] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [10:23:16] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [10:24:18] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [10:25:12] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [10:36:18] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [10:39:18] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [10:39:18] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [10:42:18] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [11:16:54] RECOVERY - Puppet freshness on ms-be1012 is OK: puppet ran at Sun Jan 27 11:16:31 UTC 2013 [12:08:02] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Sun Jan 27 12:07:59 UTC 2013 [12:09:12] RECOVERY - Puppet freshness on ms-be1012 is OK: puppet ran at Sun Jan 27 12:09:02 UTC 2013 [12:39:32] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [12:41:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [12:58:33] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [12:59:06] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.011 second response time on port 11000 [13:43:41] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:44:17] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:44:52] New review: Hashar; "PS14: send logs to localhost" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [13:50:10] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:50:59] New review: Hashar; "PS16: rebased / fixed conflict" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [14:02:30] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [14:13:39] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [14:16:38] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [14:27:54] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [14:29:00] New review: Hashar; "PS18: get an API backend IP" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [16:25:25] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:26:56] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [16:58:46] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [17:30:19] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [17:30:19] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [17:30:20] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [17:30:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:30:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:30:20] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:30:20] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [17:30:21] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:40:59] could i bug someone for some server logs pls [18:41:43] i would like to analyze api logs, preferably for at least a few days span [18:46:26] yurik: can you give me a bit more detail? [18:47:26] request logs contain private information so we can't just release them :-/. [18:47:50] ori-l: i would like to look at the usage patterns, trying to figure out what features are used and how, which sub-requests are combined into one call, etc [18:48:20] ok, let me first take a look at what kind of log data we have [18:48:28] i have already seen the logs a while back - obviously they shouldn't be released to general public, but i guess i could sign some nondisclore if you really want :) [18:48:53] ori-l: in any way, they wouldn't really contain private info - that would be posted [18:49:41] ori-l: i used that log info before to optimize api design [18:50:27] yurik: re: having had access to the log before -- yes, I completely understand how it's silly for you not to have access to that. I'm aware of who you are :) [18:51:53] as for not containing private information: you'd be surprised. The analytics team wanted to release an anonymized log of search queries and you'd be surprised what sort of things are in there -- sometimes people just paste their clipboard because they think one thing is in it but in fact it is something else [18:52:15] * yurik is riding the fame wave... hoping that wave is mostly good clean water, not something smelly :))) [18:52:31] ori-l: understood, won't disclose anything i see [18:52:33] Copy/pasting is not an issue with the API, but I'm just citing it as an example of data you wouldn't expect to contain private data [18:52:55] i mean - won't disclose anything private like that [18:53:03] obviously will publish my general findings [19:05:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [19:05:49] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [19:09:37] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [19:13:02] New review: Hashar; "Ok I managed to get a hit on the backend Apaches!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [19:17:26] New patchset: Vogone; "Added new import sources to dewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46187 [19:38:05] New patchset: Vogone; "Added new import sources to dewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46187 [19:43:59] New review: Alex Monk; "Bug 44411 - shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/46187 [19:51:10] Could ops type folks look into bug 41130 (images not being purged. Most reports from people accessing via europe)? There is a lot of user frustration building about the issue [19:57:16] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.73590942029 (gt 8.0) [20:02:32] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [20:02:42] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [20:03:42] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [20:04:52] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [20:06:34] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [20:11:33] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 12 seconds [20:11:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 15 seconds [20:12:51] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:13:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:14:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:16:03] New patchset: Hashar; "link android-commons nightly apps on CI portal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46188 [20:18:25] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [20:20:21] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [20:22:09] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.008 second response time on port 11000 [20:23:00] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [20:23:21] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [20:24:24] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [20:24:25] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [20:25:27] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [20:26:21] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [20:37:27] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [20:40:27] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [20:40:28] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [20:43:27] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [20:46:47] New patchset: Vogone; "(bug 44411) Added new import sources to dewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46187 [20:53:39] * Jasper_Deng wonders if anyone wants to merge the change for https://bugzilla.wikimedia.org/show_bug.cgi?id=44395 [21:57:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [21:57:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [22:11:44] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [22:25:14] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 16.0005276259 (gt 8.0) [22:35:25] New review: Hoo man; "Change is trivial and consensus is given, please merge" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/46187 [22:43:25] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [22:52:32] New patchset: Diederik; "Enable sampling on Thank_You_Main and country filter on Emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46189 [22:53:59] any ops around? [22:54:49] LeslieCarr: ping, in case you're on RT duty [22:55:54] anyways, if anyone from ops sees this: https://gerrit.wikimedia.org/r/46189 (change to filters in operations/puppet) should resolve packet loss issues on emery. fundraising forgot to sample. [22:56:56] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 10.6844683942 (gt 8.0) [22:58:59] ^^ case in point :-/ [22:59:32] :D [23:04:48] hey [23:05:38] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [23:05:57] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [23:06:08] New review: Faidon; "Supposed to solve a 10-15% packet loss on emery. I'll take it." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/46189 [23:06:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46189 [23:06:56] hey paravoid [23:07:00] ty! [23:07:23] running puppet on emery [23:09:26] done [23:09:47] awesome, let's hope that nagios tells us that the problem has been resolved [23:09:59] paravoid: woops, I tabbed out of IRC and didn't see that you replied [23:10:04] I just e-mailed operations [23:10:17] Would you like to reply or should I? [23:11:01] !log enabled sampling on Thank_You_Main and a country filter on Emery [23:11:03] done [23:11:14] Logged the message, Master [23:12:11] paravoid: those processes are still running, you might have to restart udp2log [23:13:07] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [23:13:15] i think it's okay: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&v=0.0&m=packet_loss_average&r=hour&z=default&jr=&js=&st=1359324678&vl=%25&z=large [23:13:36] OK, cool [23:13:50] restart what? [23:14:24] i think we are set [23:14:30] pids 21802 and 21797 are still the same as before, but maybe they don't need to be restarted if it's simply udplog that pipes less data into them [23:14:46] I'm not sure what the forking / piping behavior is [23:15:32] they're each still taking up 100% of a core but it seems like it's periodically dropping to values in the 90%s so maybe it went from not keeping up to barely keeping up [23:15:41] while not forking { fork } [23:16:10] * bork [23:18:15] Is JeLuF still around at all? [23:24:34] !log reedy synchronized php-1.21wmf8/extensions/CentralAuth/ [23:24:45] Logged the message, Master [23:50:05] !log reedy synchronized php-1.21wmf7/extensions/CentralAuth/ [23:50:17] Logged the message, Master