[01:24:43] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: Puppet has 1 failures [01:29:33] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:33] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:05:44] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail [02:07:17] Krenair, Ori turned nutcracker back on a few days ago. We both were sure that it would be harmless but not I suspect that that’s the cause of the session losses. [02:07:31] I don’t have time to pay attention to it, though, unless it hits crisis levels. [02:11:04] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [02:19:47] 6operations, 6Services, 3Discovery-Maps-Sprint: Kartotherian does not start in producton - https://phabricator.wikimedia.org/T115074#1724172 (10Yurik) 5Open>3Resolved a:3Yurik I had to refactor the code to avoid 10+ different singleton instances of mapnik, and now it starts ok. [02:23:49] (03PS1) 10Ori.livneh: Drop Grafana 1.x; promote Grafana 2.x [puppet] - 10https://gerrit.wikimedia.org/r/246136 [02:32:44] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [02:34:24] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [02:34:53] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:35:11] (03PS2) 10Ori.livneh: Drop Grafana 1.x; promote Grafana 2.x [puppet] - 10https://gerrit.wikimedia.org/r/246136 [02:36:00] (03CR) 10Ori.livneh: [C: 032] Drop Grafana 1.x; promote Grafana 2.x [puppet] - 10https://gerrit.wikimedia.org/r/246136 (owner: 10Ori.livneh) [02:36:45] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [02:37:08] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 07m 10s) [02:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:43] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-14 02:40:43+00:00 [02:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:17] 6operations, 7Graphite, 5Patch-For-Review: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1724187 (10ori) 5Open>3Resolved [02:48:10] 6operations, 7Graphite, 5Patch-For-Review: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1425863 (10ori) There are now two vhosts: * [[ https://grafana.wikimedia.org/ | grafana.wikimedia.org ]] provides a read-only, publicly-accessible view of our dashboards. * [[ https://grafana-admin... [02:55:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [03:00:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [03:05:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [03:07:54] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 10m 37s) [03:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:22] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-14 03:14:22+00:00 [03:14:23] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 14 03:14:22 UTC 2015 (duration 14m 21s) [03:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [03:20:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [03:29:15] (03PS2) 10EBernhardson: Define cirrussearch shards/replicas per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244221 [03:36:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [03:37:44] (03Abandoned) 10EBernhardson: Update CirrusSearch config to define clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237282 (https://phabricator.wikimedia.org/T109734) (owner: 10EBernhardson) [03:45:16] (03PS1) 10EBernhardson: Revert "cirrus tests: skip if full MediaWiki install is not availale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246150 [03:48:18] (03PS2) 10EBernhardson: Revert "cirrus tests: skip if full MediaWiki install is not availale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246150 [03:49:10] (03Abandoned) 10EBernhardson: Define cirrussearch shards/replicas per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244221 (owner: 10EBernhardson) [04:00:16] (03PS3) 10EBernhardson: Revert "cirrus tests: skip if full MediaWiki install is not availale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246150 [04:01:15] (03CR) 10Ori.livneh: [C: 032] Revert "cirrus tests: skip if full MediaWiki install is not availale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246150 (owner: 10EBernhardson) [04:01:22] (03Merged) 10jenkins-bot: Revert "cirrus tests: skip if full MediaWiki install is not availale" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246150 (owner: 10EBernhardson) [04:02:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:39:19] (03PS1) 10Ori.livneh: grafana: enable gzip [puppet] - 10https://gerrit.wikimedia.org/r/246153 [04:42:25] (03CR) 10Ori.livneh: [C: 032] grafana: enable gzip [puppet] - 10https://gerrit.wikimedia.org/r/246153 (owner: 10Ori.livneh) [05:18:56] (03Abandoned) 10KartikMistry: Fix nbwiki to nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244736 (owner: 10Amire80) [05:23:45] (03CR) 10EBernhardson: "didn't hear from anyone either way, so i've gone ahead and implemented the change so that all ElasticaWrite job failures are handled inter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [05:24:48] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: Puppet has 1 failures [05:51:48] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:02:27] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1724286 (10Nemo_bis) Well, gerrit and Phabricator emails are certainly very bad. Multiple bugs have been reported in the last few years with actionable suggestions (some of whi... [06:30:19] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [06:32:59] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:09] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:18] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:50:38] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [06:51:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [10.0] [06:52:48] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [10.0] [06:53:28] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [10.0] [06:54:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [10.0] [06:54:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [10.0] [06:55:59] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:58] RECOVERY - Kafka Broker Server on kafka1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [06:59:37] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [07:01:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [10.0] [07:02:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [10.0] [07:04:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [10.0] [07:04:59] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [10.0] [07:08:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 1.00% above the threshold [1.0] [07:10:27] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 1.00% above the threshold [1.0] [07:10:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 1.00% above the threshold [1.0] [07:11:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 1.00% above the threshold [1.0] [07:11:47] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0] [07:43:18] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [07:44:57] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [08:20:54] (03PS1) 10Dereckson: Add *.unesco.org to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) [08:25:29] (03PS2) 10Dereckson: Add *.unesco.org to server-side upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246167 (https://phabricator.wikimedia.org/T115338) [08:30:14] (03PS1) 10Dereckson: Enable WikidataPageBanner on fr.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246169 (https://phabricator.wikimedia.org/T115023) [08:55:30] (03PS1) 10Dereckson: Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) [09:00:11] (03CR) 10Dereckson: "See https://phabricator.wikimedia.org/P2193 for a compact file which illustrates how that works." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [10:12:43] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1724655 (10Aklapper) [11:48:48] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [11:50:27] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [11:53:38] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:56:29] 6operations, 6Labs, 10wikitech.wikimedia.org: Nutcracker having issues on wikitech - https://phabricator.wikimedia.org/T115457#1724804 (10Krenair) 3NEW [12:07:11] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/VisualEditor/modules/ve-mw/ui/inspectors/ve.ui.MWLinkAnnotationInspector.js: https://gerrit.wikimedia.org/r/#/c/246205/ (duration: 01m 13s) [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:20] Why does it take so long to connect to mira? [12:09:34] debug1: Connecting to mira [2620:0:860:102:10:192:16:132] port 22. [12:09:35] debug1: connect to address 2620:0:860:102:10:192:16:132 port 22: Connection timed out [12:09:35] debug1: Connecting to mira [10.192.16.132] port 22. [12:09:35] debug1: Connection established. [12:09:35] Ugh. [12:11:21] I bet this is the firewall resolving trusted hosts only to IPv4 addresses. [12:13:38] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1724836 (10Elitre) >>! In T115416#1723815, @greg wrote: > Btw, I just got this when in gmail's web UI to mark them not as spam: > {F2717835} > > Anyone else? Didn't get that... [12:14:47] And yet it works in reverse. Interesting. [12:15:07] Oh, right. Because no firewall for tin. Okay. [12:18:52] (03PS1) 10Alex Monk: Fix scap firewall rules to allow connecting to mira over IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/246206 [12:20:28] (03CR) 10Faidon Liambotis: [C: 032] "Thanks. This is really a ferm bug that I need to fix at some point..." [puppet] - 10https://gerrit.wikimedia.org/r/246206 (owner: 10Alex Monk) [12:20:37] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:22:20] paravoid, er, so I wrote that [12:22:29] wrote what? [12:22:30] And then realised that other puppet manifests use something else to allow the same thing [12:22:51] We have deployment_hosts in special_hosts [12:22:58] ferm::rule { 'deployment-ssh': [12:22:58] ensure => present, [12:22:58] rule => 'proto tcp dport ssh saddr $DEPLOYMENT_HOSTS ACCEPT;', [12:22:58] } [12:23:02] is what the mediawiki role uses [12:23:29] yeah I guess that's equivalent [12:23:33] should I change it to that, or...? [12:23:38] yeah sure, that works [12:28:06] (03PS1) 10Alex Monk: Use the same ferm rule for deployment hosts as we do for mediawiki etc. [puppet] - 10https://gerrit.wikimedia.org/r/246207 [12:28:47] I have to go now, I'll merge it later [12:28:49] sorry [12:28:53] ok, no problem [12:32:03] !log krenair@tin Synchronized README: testing https://gerrit.wikimedia.org/r/246206 (duration: 00m 17s) [12:32:07] much better [12:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:40] !log ERROR: Could not connect to SMTP host: polonium.wikimedia.org, port: 25 (from labs instances) [12:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:37] hashar, yeah polonium was replaced I think? [12:59:54] mx100[12]? [13:00:00] oh, just mw1001 [13:00:03] mx1001* [13:00:09] hieradata/labs/sentry/common.yaml:sentry::smtp_host: 'polonium.wikimedia.org' [13:00:09] hieradata/role/common/sentry.yaml:sentry::smtp_host: 'polonium.wikimedia.org' [13:00:13] still defined at some place [13:00:19] * Krenair facepalms [13:01:00] and in Jenkins that is hardcoded in the definition (Jenkins is not pauperized), but that part we don't mind suffering disruption and having to fix it [13:01:35] ah it was mx2001 in codfw [13:01:38] so yeah apparently $mail_smarthost 'labs' => [ 'mx1001.wikimedia.org', 'mx2001.wikimedia.org' ], [13:01:54] # spare [13:01:54] node 'polonium.wikimedia.org' { [13:03:07] maybe I should get jenkins to send its mail to localhost and get a smtp relay on gallium [13:03:16] to my knowledge polonium is mx1001 and lead has become mx2001 [13:03:30] which is nice, because one is eqiad and one is codfw now :) [13:04:46] and we also have wiki-mail.wikimedia.org pointing to mx1001 :D [13:05:06] should probably point stuff at that... [13:05:39] unless it is legacy [13:07:51] if people don't use $mail_smarthost they're on their own as far as I care [13:10:43] 6operations, 7Graphite, 5Patch-For-Review: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1724974 (10BBlack) Thanks! log-scale y-axis now works :) [13:11:27] I don't understand why sentry should have its own hiera config for the smarthost [13:11:44] and as for jenkins, I don't understand why it needs to use SMTP and can't use sendmail() [13:11:57] gallium does run an exim, as all hosts do [13:12:02] !log restarted diamond, puppet didn't seem to after it removed the TcpConnStates from most hosts [13:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:55] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1724991 (10Andrew) This will need some misc database support as well -- adding Jaime to the ticket. We could potentially have the dbs all run on the c... [13:15:12] (and yes, wiki-mail is legacy -- it actually never was supposed for what people used it for) [13:16:41] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725004 (10coren) Have you coordinated with @papaul about the currently unknown-function Cisco servers that used to be Labs and that currently live in D... [13:19:55] (03PS1) 10Ottomata: Turn on varnish reqstats diamond collector on eqiad mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/246215 (https://phabricator.wikimedia.org/T83580) [13:22:45] (03CR) 10Ottomata: [C: 032] Turn on varnish reqstats diamond collector on eqiad mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/246215 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [13:23:10] (03PS4) 10Ori.livneh: Add dblist to many paths [puppet] - 10https://gerrit.wikimedia.org/r/244743 (owner: 10Reedy) [13:23:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Add dblist to many paths [puppet] - 10https://gerrit.wikimedia.org/r/244743 (owner: 10Reedy) [13:30:08] (03PS1) 10Ori.livneh: decom grafana-test and tessera vhosts [dns] - 10https://gerrit.wikimedia.org/r/246218 [13:30:31] !log kafka-preferred-replica election after kafka1012 [13:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:45] ori: have you upgraded grafana already? :D [13:33:01] (03CR) 10Luke081515: [C: 031] Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [13:35:26] (03PS1) 10Andrew Bogott: Fixed the CST paging hours. Previously AM/PM were reversed. [puppet] - 10https://gerrit.wikimedia.org/r/246219 [13:37:23] (03PS2) 10Andrew Bogott: Fixed the CST paging hours. Previously AM/PM were reversed. [puppet] - 10https://gerrit.wikimedia.org/r/246219 [13:38:22] (03CR) 10Andrew Bogott: [C: 032] Fixed the CST paging hours. Previously AM/PM were reversed. [puppet] - 10https://gerrit.wikimedia.org/r/246219 (owner: 10Andrew Bogott) [13:38:56] (03PS1) 10Muehlenhoff: Move base::debdeploy into the base class [puppet] - 10https://gerrit.wikimedia.org/r/246220 [13:44:10] (03PS1) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 [13:50:27] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:54] (03PS1) 10Muehlenhoff: tor: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246223 [13:52:07] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.002 second response time [13:56:39] (03PS1) 10Muehlenhoff: wdqs: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246224 [13:58:34] 6operations, 10ops-codfw: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1725144 (10RobH) Also please boot each and ensure the disks were wiped. If not, please wipe the disks. (Wipe in software, not the degausser.) [14:02:24] 6operations, 7Monitoring: non sms alternatives - https://phabricator.wikimedia.org/T114651#1702155 (10RobH) I wouldn't want to rely on data instead of SMS alerts. Some parts of apartments have horrible data, but SMS works. Thus, I can easily be in a situation where my phone isn't on data, but if I get an SMS... [14:02:25] 6operations, 7Monitoring: non sms alternatives - https://phabricator.wikimedia.org/T114651#1725159 (10RobH) [14:02:28] 6operations, 7Monitoring: non sms alternatives - https://phabricator.wikimedia.org/T114651#1702155 (10RobH) [14:06:11] 6operations, 7Monitoring: non sms alternatives - https://phabricator.wikimedia.org/T114651#1725164 (10coren) The reason why I like data-based altering (in addition to/with SMS) is that it tends to work better when you have intermittent coverage (because of the way it works, with positive ack, means that you'll... [14:06:30] 6operations, 5Patch-For-Review, 7Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1725165 (10Dzahn) As @akosiaris pointed out yesterday the aggregated graphs don't seem to work currently. That would mean we don't have this feature anyways and the others have been replaced by librenams. [14:12:59] hashar: yep [14:13:26] ori: that is great, would you mind restoring the previous landing page? Not sure where it went / how it was named though [14:13:41] I had a use case for Grafana 2.x so it is good to see it being upgraded [14:14:41] ori: there is some work going on to add ElasticSearch as a backend. So in theory we could create boards for our logstash data [14:15:38] ah http://docs.grafana.org/v2.1/reference/dashlist/ [14:15:40] !log mw1157 - deleted puppet lock file, fix puppet run. ("already running" but didnt since 18h) [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:57] 6operations, 7Graphite, 5Patch-For-Review: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1725170 (10hashar) Thank you a ton @ori We should get http://docs.grafana.org/v2.1/reference/dashlist/ among others :-} [14:17:45] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:18] (03PS1) 10Ori.livneh: decom tessera [puppet] - 10https://gerrit.wikimedia.org/r/246227 [14:20:13] (03CR) 10Ori.livneh: [C: 032] decom tessera [puppet] - 10https://gerrit.wikimedia.org/r/246227 (owner: 10Ori.livneh) [14:20:22] (03PS1) 10Ottomata: Enable diamond reqstats collector for all mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/246229 (https://phabricator.wikimedia.org/T83580) [14:21:10] (03PS2) 10Ottomata: Enable diamond reqstats collector for all mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/246229 (https://phabricator.wikimedia.org/T83580) [14:21:40] (03PS1) 10Ori.livneh: misc varnish: drop tessera.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/246230 [14:22:26] (03PS1) 10Alex Monk: ldaplist: Add support for projects/projectroles [puppet] - 10https://gerrit.wikimedia.org/r/246231 [14:24:45] !log cleaned up tessera leftovers on graphite1001 [14:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:03] (03CR) 10Ottomata: [C: 032] Enable diamond reqstats collector for all mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/246229 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [14:25:58] (03PS2) 10Ori.livneh: misc varnish: drop tessera.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/246230 [14:26:41] (03CR) 10Ori.livneh: [C: 032] misc varnish: drop tessera.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/246230 (owner: 10Ori.livneh) [14:27:05] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1725187 (10Dzahn) [14:27:12] (03CR) 10Ori.livneh: [C: 032] decom grafana-test and tessera vhosts [dns] - 10https://gerrit.wikimedia.org/r/246218 (owner: 10Ori.livneh) [14:28:00] wikibugs: talk [14:28:19] wikibugs: ping [14:28:30] wikibugs: yo [14:28:55] wikibugs: sup [14:31:34] works on other channel [14:31:36] < wikibugs> Wikibugs: Wikibugs should not say ", and 0 others" - https://phabricator.wikimedia.org/T90324#1725196 (Dzahn) bla bla bla (testing wikibugs) [14:40:52] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:42:05] mutante|1way: what's the alert text? [14:42:11] (icinga is being slow now) [14:42:12] (03PS1) 10Ori.livneh: Set up regular backups for the Grafana database [puppet] - 10https://gerrit.wikimedia.org/r/246236 [14:42:16] akosiaris: ^ [14:42:38] yuvipanda: CRITICAL: Puppet has 1 failures [14:43:05] mutante|1way: no, for the tools-shadow bug you re-opened? [14:43:43] yuvipanda: CRITICAL: master class instances not spread out enough [14:43:55] hmm [14:43:57] not sure [14:44:00] why that's the case [14:44:02] I'll check later [14:44:04] i just opened that one because the icinga still had a comment [14:44:07] (03CR) 10Alexandros Kosiaris: [C: 032] Set up regular backups for the Grafana database [puppet] - 10https://gerrit.wikimedia.org/r/246236 (owner: 10Ori.livneh) [14:44:07] yeah [14:44:07] that linked to this ticket [14:44:10] ok, cool [14:44:14] thanks! [14:44:15] that was fixed a few weeks ago [14:44:48] it seems different for labcontrol1001 vs. 1002 [14:44:59] the time since it became crit i mean [14:47:19] (03PS1) 10Alexandros Kosiaris: ldap.conf: Remove openldap unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/246242 [14:47:28] (03CR) 10Dzahn: [C: 032] Format Tmax in slow queries page as a number [software/tendril] - 10https://gerrit.wikimedia.org/r/244964 (owner: 10Alex Monk) [14:47:47] (03CR) 10Dzahn: [V: 032] Format Tmax in slow queries page as a number [software/tendril] - 10https://gerrit.wikimedia.org/r/244964 (owner: 10Alex Monk) [14:50:07] (03PS3) 10Andrew Bogott: Openstack: Don't notify keystone when the keystone policy changes [puppet] - 10https://gerrit.wikimedia.org/r/244349 [14:51:06] (03CR) 10Andrew Bogott: [C: 032] Openstack: Don't notify keystone when the keystone policy changes [puppet] - 10https://gerrit.wikimedia.org/r/244349 (owner: 10Andrew Bogott) [14:55:02] (03PS2) 10Dzahn: Include base::firewall into the planet role [puppet] - 10https://gerrit.wikimedia.org/r/245970 (owner: 10Muehlenhoff) [14:57:00] legoktm: could you fix wikibugs. it stopped talking here but not in other channels [14:57:17] hello [14:57:20] wikibugs: ! [14:57:42] 6operations, 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1725253 (10Legoktm) [14:57:47] eh :) [14:57:53] seems to work fine? [14:57:57] now it does :p [14:58:18] fixed by asking :) thx [14:58:25] yw :) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151014T1500). [15:00:04] MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:42] (03CR) 10Dzahn: [C: 032] Include base::firewall into the planet role [puppet] - 10https://gerrit.wikimedia.org/r/245970 (owner: 10Muehlenhoff) [15:01:07] hi. [15:02:42] MatmaRex: hiya, I'll SWAT today. [15:04:49] (03PS2) 10Dzahn: tor: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246223 (owner: 10Muehlenhoff) [15:05:00] !log Statsv and eventlogging-navtiming seems to have gone down 7 hours ago [15:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:00] (03PS1) 10Muehlenhoff: Move dnsrecursor to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246244 [15:09:45] (03CR) 10Giuseppe Lavagetto: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/246244 (owner: 10Muehlenhoff) [15:11:12] !log thcipriani@tin Synchronized php-1.27.0-wmf.3/extensions/UploadWizard/UploadWizard.config.php: SWAT: Remove default category for UploadWizard files [[gerrit:246226]] (duration: 00m 18s) [15:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:35] MatmaRex: ^ Check if possible please. [15:11:58] thcipriani: it only affects commons and it's still on wmf.2 [15:12:11] (03CR) 10Paladox: "Not really a setting can be called what ever they like as long as they follow the mediawiki convention on how they are set out. The settin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 (owner: 10Paladox) [15:12:36] MatmaRex: kk, going ahead with 2 [15:14:58] !log restarted navtiming and statsv services on hafnium [15:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:55] !log thcipriani@tin Synchronized php-1.27.0-wmf.2/extensions/UploadWizard/UploadWizard.config.php: SWAT: Remove default category for UploadWizard files [[gerrit:246225]] (duration: 00m 17s) [15:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:11] ^ MatmaRex check please [15:20:43] thcipriani: thanks, looking [15:24:16] thcipriani: thanks, it works as expected [15:24:26] MatmaRex: nice. Thanks for checking! [15:24:57] dcausse: ping for SWAT, looks like the bot missed your patch. [15:25:05] thcipriani: hi [15:26:45] dcausse: howdy, so is this just a maintenance script update? Don't need me to run anything, right? [15:27:12] thcipriani: yes no need to run anything, it's for the new codfw elastic cluster [15:27:24] dcausse: cool, just making sure :) [15:31:48] !log thcipriani@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: SWAT: Split connection to source and target [[gerrit:246243]] (duration: 00m 18s) [15:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:05] ^ dcausse sync'd! check if possible [15:32:36] thcipriani: looks good, thanks! [15:32:52] dcausse: cool, thank you! [15:45:49] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1725452 (10bd808) New fun today, one message from {T115440} marked as a possible phishing scam. {F2720728} [15:49:37] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1725479 (10RobH) [15:49:38] 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1725478 (10RobH) [15:50:49] 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1637061 (10RobH) We've recently changed the policy on ordering ssl certificates to put in the ssl_renewals alias for all ssl cert info. Additionally, we should see if we can change the backend conta... [15:50:53] (03PS1) 10Ori.livneh: grafana: provide the ability to provision dashboards via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/246252 [15:51:06] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1725485 (10greg) FWIW, I also had a ton of Shinken alerts in my spam folder as well. [15:52:10] (03PS2) 10Ori.livneh: grafana: provide the ability to provision dashboards via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/246252 [15:52:52] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1725487 (10mmodell) Thanks. The graphs... [15:52:59] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1725489 (10mmodell) 5Open>3Resolved [15:53:04] (03CR) 10Ori.livneh: [C: 032] grafana: provide the ability to provision dashboards via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/246252 (owner: 10Ori.livneh) [15:53:56] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1582035 (10mmodell) [16:00:12] 6operations: Create roles for test systems and spares - https://phabricator.wikimedia.org/T115489#1725516 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [16:14:24] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1725570 (10Andrew) 3NEW a:3mark [16:15:59] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1725581 (10ori) Grafana is now set up to load JSON dashboard definition files from a local directory which the Grafana Puppet module manages. There is a `grafana::dashboard` resource that mak... [16:17:11] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1725583 (10Andrew) 3NEW a:3mark [16:21:02] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1725608 (10coren) Given that we are looking to do vlan config with Neutron, I think it'd be wise to allocate a very large subnet fo... [16:25:57] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725615 (10Andrew) [16:26:03] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725616 (10RobH) Ok, allocations for this: * WMF3763 (previously named ssl2002) - 8GB System * WMF3810 (previously named ssl2003) - 8GB System * WMF58... [16:32:42] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725645 (10Andrew) Please cable up two nics for: wmf5850 (labtestvirt2001) WMF5835 (labtestnet2001) The second nic will be on the instance vlan. The... [16:43:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [16:44:11] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:41] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:00:04] moritzm: Dear anthropoid, the time has come. Please deploy Operations (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151014T1700). [17:09:50] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:19:11] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:30] aha, it's not just me [17:19:40] akosiaris: ^^^ [17:31:11] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.002 second response time [17:32:24] 6operations, 7HHVM, 5Patch-For-Review: Package and deploy HHVM 3.6.5+dfsg1-1+wm7 - https://phabricator.wikimedia.org/T112640#1725840 (10bd808) [17:34:32] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1725853 (10Smalyshev) [17:42:05] (03CR) 10Eranroz: Configure default Echo subscriptions user options on he.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [17:56:23] (03PS2) 10Chad: Load some more extensions directly through wfLoadExtension() (B-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232966 (owner: 10Legoktm) [17:56:35] legoktm: Any objections to ^? [17:57:01] ostriches: nope! :) [17:57:24] (03CR) 10Chad: [C: 032] Load some more extensions directly through wfLoadExtension() (B-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232966 (owner: 10Legoktm) [17:57:31] (03Merged) 10jenkins-bot: Load some more extensions directly through wfLoadExtension() (B-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232966 (owner: 10Legoktm) [17:57:39] legoktm: CX is spamming the warning log about using the deprecated extension loading format. [17:57:40] :) [17:57:50] legoktm: Did I do https://gerrit.wikimedia.org/r/#/c/246276/1 right? [17:58:21] RoanKattouw: yes [17:58:49] !log demon@tin Synchronized wmf-config/extension-list: (no message) (duration: 00m 17s) [17:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:57] RoanKattouw: if you look at https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/6019/console, it shows you that Flow will no longer run mwext-qunit [17:59:06] (03PS1) 10Ori.livneh: grafana: override the default home dashboard with something custom [puppet] - 10https://gerrit.wikimedia.org/r/246278 [17:59:13] !log demon@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 17s) [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:25] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: override the default home dashboard with something custom [puppet] - 10https://gerrit.wikimedia.org/r/246278 (owner: 10Ori.livneh) [17:59:49] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1725915 (10Niedzielski) I filed a similar ticket[0] back in August for Phabricator and mailing list messages. The issue actually seemed to have resolved itself for the past mon... [18:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151014T1800). [18:00:20] jouncebot: ok fine [18:00:47] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246279 [18:01:10] A database query error has occurred. This may indicate a bug in the software. <-- at dewiki [18:01:29] I can't see any page [18:01:47] (03PS1) 10Chad: Use $wgFlaggedRevsTags instead of $wgFlaggedRevTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246281 [18:01:47] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725921 (10RobH) @Andrew: Chatting with @Mark @ lunch, we need to clear the allocation of the 32GB misc machine with him. (It was a passing comment, s... [18:01:57] "Error: 1066 Not unique table/alias: 'page_props' (10.64.48.25)" [18:01:58] legoktm: Awesome. Let's see if I have the rights to deploy that [18:01:59] ehm. i just got 0 content in the main page.... [18:02:04] :( [18:02:27] twentyafterfour: revert [18:03:04] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246279 (owner: 1020after4) [18:03:07] RoanKattouw: I just did :) [18:03:11] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246279 (owner: 1020after4) [18:03:20] 303 Compilation failed: two named subpatterns have the same name at offset 263 in /srv/mediawiki/php-1.27.0-wmf.2/includes/MagicWord.php on line 960 [18:03:29] 6operations: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1725929 (10Luke081515) 3NEW [18:03:31] Yep. [18:04:14] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1725936 (10Krenair) >>! In T115416#1725915, @Niedzielski wrote: > I filed a similar ticket[0] back in August for Phabricator and mailing list messages. The issue actually seeme... [18:04:16] 6operations: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1725938 (10greg) {F2721219} [18:04:31] dewiki is broken folks. anybody working on figuring out why? [18:04:33] Function: RandomPage::selectRandomPageFromDB [18:04:33] Error: 1066 Not unique table/alias: 'page_props' (10.64.48.27) [18:04:50] greg-g postet a screenshot of commons, same here [18:04:51] revert the deploy [18:04:52] https://de.wikipedia.org/wiki/Main_Page is acked up too [18:04:55] now [18:04:57] *posted [18:05:09] group1 wiki deploy, I'd bet. [18:05:17] twentyafterfour: revert [18:05:18] I wonder, wikipedias are group2 [18:05:19] (03PS1) 10Ori.livneh: Revert "group1 wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246283 [18:05:23] no, the errors are on wmf2 [18:05:39] what was the issue? [18:05:49] you can't see any page [18:05:50] I didn't deploy it yet [18:05:54] (03PS1) 10Ori.livneh: Follow-up for I5667f712: make the top panel transparent [puppet] - 10https://gerrit.wikimedia.org/r/246284 [18:05:58] maybe https://gerrit.wikimedia.org/r/#/c/232966/ ? [18:05:58] I don't think it's just dewiki [18:06:02] en.wiki too [18:06:06] (03CR) 10Greg Grossmeier: "Krenair: Please remove your -2 on this patch. Chris S and I agree this is OK to do. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [18:06:06] en.wiki also [18:06:08] (03PS1) 10Legoktm: Revert "Load some more extensions directly through wfLoadExtension() (B-D)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246285 [18:06:11] it's not related to any change I made [18:06:13] mw and meta as well [18:06:16] (03CR) 10Legoktm: [C: 032] Revert "Load some more extensions directly through wfLoadExtension() (B-D)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246285 (owner: 10Legoktm) [18:06:20] what the [18:06:22] (03Merged) 10jenkins-bot: Revert "Load some more extensions directly through wfLoadExtension() (B-D)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246285 (owner: 10Legoktm) [18:06:24] legoktm: Already deploying. [18:06:33] !log demon@tin Synchronized wmf-config: (no message) (duration: 00m 19s) [18:06:37] yeah [18:06:37] legoktm: [18:06:37] is right [18:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:41] (03CR) 10Alex Monk: "I don't agree that this is OK to do with the extension in it's current state." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [18:06:47] ok [18:06:53] legoktm: thanks [18:06:54] reverted? [18:06:58] (03CR) 10Ori.livneh: [C: 032] Follow-up for I5667f712: make the top panel transparent [puppet] - 10https://gerrit.wikimedia.org/r/246284 (owner: 10Ori.livneh) [18:07:09] <_joe_> are we ok? [18:07:11] works now! Thanks [18:07:17] https://de.wikipedia.org/wiki/Main_Page back after ?action=purge [18:07:17] yes [18:07:19] _joe_: Ya [18:07:21] PROBLEM - MariaDB Slave Lag: s1 on db1051 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 381 [18:07:25] ok sorry about that [18:07:25] legoktm: The heck? [18:07:29] it should have been a no-op [18:07:36] main page on commons is still empty for me [18:07:37] apparently not [18:07:37] Indeed. [18:07:47] <_joe_> what the fuck guys [18:07:52] greg-g: I purged it [18:07:59] back [18:08:02] <_joe_> it's the second release done during this week [18:08:05] _joe_: we just missed you, wanted you back [18:08:07] <_joe_> that breaks the site [18:08:10] well [18:08:13] This wasn't a release. [18:08:14] _joe_: read first, it wasn't the deploy [18:08:15] what about all the other pages? [18:08:19] <_joe_> so much for "no breaking changes" [18:08:23] <_joe_> whatever. [18:08:28] * twentyafterfour didn't break it [18:08:31] <_joe_> was it deployed? [18:08:31] :P [18:08:34] presumably there are plenty of others [18:08:36] that need a purge [18:08:44] _joe_: It wasn't supposed to be a breaking change, it was supposed to be a no-op. [18:08:47] hey [18:08:54] ? [18:08:55] what's the TL;DR of what's going on? [18:08:55] there is an unknown number of pages that have been blanked, no? [18:09:02] those 200s were cached by varnish. Not sure there is any way to purge anything added to cache in the last N minutes [18:09:03] the answer to a deploy problem is not purge the caches :P [18:09:10] this was the change that broke it, right? https://gerrit.wikimedia.org/r/#/c/232966/ [18:09:13] paravoid: change to how extensions are loaded, deployed by ostriches, reverted by legoktm [18:09:14] yes, we need to purge [18:09:15] paravoid: config change caused blank pages. [18:09:16] um, there's also high s1 slave lag [18:09:22] otherwise, there is no content visible [18:09:24] paravoid: site is now fine, but many pages were cached as blank [18:09:30] is there some way to identify the affected pages? [18:09:36] we got paged for mariadb lag [18:10:02] 6operations: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1725947 (10demon) 5Open>3Resolved a:3demon [18:10:26] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1725949 (10Andrew) Naming! wmf5850 (32G): labtestvirt2001 WMF5835 (16G): labtestnet2001 WMF3763 (8GB): labtestneutron2001 WMF3810 (8GB): labtestmetal20... [18:10:29] seemingly all kinds of pages (anything newly cached in the last few minutes I guess?) [18:10:32] is there a specific response header included in only the messed up blank responses I could varnish ban on, for instance? [18:10:33] how blank was blank? blank-blank or no article contents? [18:10:40] No article contents [18:10:47] is there other content? example? [18:10:47] revert [18:10:51] And likely, just pages re-rendered during the downtime, if I had to guess. [18:10:53] No article contents. Sometimes (for me) loading the "Categories" box though. [18:10:54] jynus: already reverted [18:10:55] empty page basically [18:10:55] it is the code deploy [18:10:57] I could ban small content-length maybe too [18:10:58] :--) [18:11:03] <_joe_> so to answer bblack: nope [18:11:04] jynus: already reverted [18:11:06] the skin loaded, but the article text was empty [18:11:12] uhm, blank pages now have content for me [18:11:12] example link? [18:11:21] so maybe teh blanks weren't cached? [18:11:22] twentyafterfour: because people have been purging them [18:11:25] I'm trying to find one...I think people already purged them [18:11:26] they were cached [18:11:28] Example? For example when I load https://en.wikipedia.org/wiki/GNU_General_Public_License [18:11:32] but there is probably a long tail [18:11:36] ^^^ DONT PURGE THAT ONE [18:11:42] GNU General Public License [18:11:42] From Wikipedia, the free encyclopedia [18:11:42] Categories (++): (+) [18:11:53] is all the content I get even when bypassing browser cache [18:11:56] 6operations: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1725959 (10greg) Caused by https://gerrit.wikimedia.org/r/#/c/232966/, reverted https://gerrit.wikimedia.org/r/#/c/246285/ 9 minutes later. [18:12:10] andre__: yes, good example [18:12:12] that page also has [18:12:12] that link fully loads for me no blank. [18:12:13] for that error [18:12:20] so it's in parsercache too [18:12:23] jynus: wasn't the main deploy, was a config change, see that task ^ [18:12:46] we can use the parser cache hook handler to reject items cached during the outage [18:12:50] not blank here either [18:12:55] That's the kind of patch I'd test on mw1017 before fully syncing... [18:13:04] wait, so they're cached in parser cache as well? [18:13:12] it's not just varnish-cached then? [18:13:13] looks like it [18:13:18] it's a varnish thing so I would expect different results from different DCs [18:13:18] awesome [18:13:31] that's not how it works, bd808 [18:13:54] all varnish fe nods have the same data paravoid? [18:13:57] *nodes [18:14:07] the parser cache key changed [18:14:11] so i think they are blank [18:14:14] okay so plan: [18:14:19] all varnish fes fetch from the same eqiad bes ultimately [18:14:23] ori: we could blacklist anything where the contents are just html comments and whitespace? [18:14:24] Well, I get a blank one for the GPL article on en.wp, and I'm via esams I guess (Or not, technical people might know better) [18:14:28] parser cache hook, if article content is blank, rejhect parser cache entry, emit purge [18:14:32] so we can get all the URLs requested during that time period from kafka [18:14:34] but there can be timing races involved, so yes fe at two sites can end up with different versions of content [18:14:37] (03PS7) 10Merlijn van Deen: toollabs: add composer to dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/246072 (https://phabricator.wikimedia.org/T104789) [18:14:40] and issue purges for them, can't we? [18:14:44] i think what i suggested would work [18:14:49] that's the hook you hated, bblack :P [18:14:49] ori++ [18:15:15] i did this for the hieroglyph extension once [18:15:17] let me get the code [18:15:24] <_joe_> ori: seems like that would work, yes. [18:15:31] we could also do something dumb like purge all matching /wiki/ + content-length < X [18:15:48] more dumb stuff +1 [18:15:54] no, that'll match a lot of article redirects [18:16:00] * twentyafterfour doesn't think we've done enough stupid already [18:16:00] + 200 [18:16:17] do we cache redirects? I guess so [18:16:18] sigh [18:16:26] we can match a ban on boolean of anything in the response headers, basically [18:16:34] 200 + /wiki/ + content-length < X [18:16:44] (please don't make comments if you don't have anything constructive to say) [18:17:05] ori: are you implementing your fix? [18:17:08] <_joe_> bblack: I looked at the resp headers from the gpl page before it was purged, yes I don't see anything else that could work [18:17:30] legoktm: So which extension made it explode? :\ [18:17:31] I figured out where I fucked up. [18:17:33] LinksUpdate::incrTableUpdate FROM pagelinks [18:17:35] <_joe_> bblack: ofc that is going to generate a lot of false positives [18:17:42] 2497 +» wfLoadExtension( 'Disambiguator' ); [18:17:42] 2497 2498 » require_once( "$IP/extensions/Disambiguator/Disambiguator.php" ); [18:17:47] what's the typical content length of one of these empty-content pages? [18:17:52] I didn't remove the require_once for Disambiguator [18:18:10] i'm on a really shitty connection [18:18:11] https://dpaste.de/hMA8/raw [18:18:13] apergos: it probably varies quite a lot... language bar, article title... [18:18:13] is the basic idea [18:18:16] legoktm: We should do these 1 by 1 from now on, rather than batches :\ [18:18:19] that's the code from syntaxhighlight [18:18:23] ok scratch that then [18:18:29] but /* some condition */ needs to be written in [18:18:30] (03CR) 10Legoktm: Load some more extensions directly through wfLoadExtension() (B-D) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232966 (owner: 10Legoktm) [18:18:34] ori: Turning that into a legit patch [18:18:36] yeah :/ [18:18:38] thanks [18:18:48] 6operations: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1725971 (10Aklapper) Status update: Currently potential followup actions are being investigated. [18:18:49] ok, so ostriches running point then? [18:18:58] yeah [18:19:02] good, thanks [18:19:03] <_joe_> ok good [18:19:04] yes, i am on too shitty a connection to commit to anything [18:19:15] thanks opsen, go back to the beach [18:19:35] lol [18:19:38] anything I can help with? [18:19:39] :) [18:19:43] it's just jealousy [18:19:45] bblack: maybe i'm wrong and a ban would be better [18:19:46] * andre__ takes a walk to the Village Pumps [18:20:01] andree__: Good Idea [18:20:03] mark: yeah, we hated the idea of you all not working on our stuff, so we paged the heck out of you [18:20:07] At dewiki this is clear [18:20:12] the hook would only run as those pages are requested from the app servers, so it would require a page load from a logged in user [18:20:15] was the s1 slave lag related? [18:20:18] greg-g: incident doc for this and yesterday's too, after all of this is settled [18:20:30] greg-g: because I still don't understand how something like this could escape all of our QA [18:20:34] the lag was created by the 8 minute query I just pasted [18:20:42] that'll take care of it eventually, but if a ban on content-length < XXX is not too hard that would be faster [18:21:03] jynus: pasted where? [18:21:09] paravoid: there isn't much qa on config changes, honestly [18:21:11] ori: That's going to do a /lot/ of false positives for any value XXX [18:21:26] <_joe_> 20:17 < jynus> LinksUpdate::incrTableUpdate FROM pagelinks [18:21:27] oh so this was just a mw-config change, ok [18:21:28] paravoid: yeah, legoktm will write an incident report for this one, right legoktm ? :P [18:21:31] Coren: yes, but false positives = backend request, and presumably a cheap one for the false positives [18:21:39] greg-g, paravoid: yeah, I can do that [18:21:54] (03PS1) 10Merlijn van Deen: toollabs: install mailutils [puppet] - 10https://gerrit.wikimedia.org/r/246289 (https://phabricator.wikimedia.org/T114073) [18:22:05] ori: have an idea on length limit? [18:22:14] (consider lang variants, skin, etc?) [18:22:17] paravoid: yes not a deployment ...just bad timing (I was about to sync group1 when it blew up) [18:22:23] ori: Yeah, I'm just woried that X needs to be fairly high if we want to avoid false negatives and much would be uncachable for a while. [18:22:26] I have a ban expression prepped for what we said above [18:22:47] <_joe_> if anyone finds a page which presents the problem, please paste the link here [18:22:54] (and don't purge it heh) [18:23:03] <_joe_> yep, exactly :) [18:23:36] did this last from 17:58 to 18:06? [18:24:01] trying to extrapolate from https://wikitech.wikimedia.org/wiki/Server_Admin_Log but all of the deploys say "(no message)" so it's hard to tell [18:24:18] greg-g: why is legoktm on the hook for writing the postmortem?! [18:24:22] paravoid: think so [18:24:46] paravoid: 17:59 to 18:06 [18:24:48] ehm. i just got 0 content in the main page.... [18:24:56] where? [18:25:03] enwiki? from the us or eu? [18:25:03] which main page? [18:25:06] enwiki? [18:25:06] sry [18:25:08] no [18:25:08] ori: his patch, right? [18:25:30] RECOVERY - MariaDB Slave Lag: s1 on db1051 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:25:31] (03PS1) 10Chad: Purge pages with blank content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246290 [18:25:37] (03CR) 10jenkins-bot: [V: 04-1] Purge pages with blank content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246290 (owner: 10Chad) [18:25:38] ori, others: Let's not rush, but ^^^ [18:26:12] Starting with valid php would help [18:26:21] sorry, my screen session cached a key up and when connection recovered, i just happen to press enter and my last history entry was posted in IRC :) [18:26:38] ostriches: isn't the newPP report cached? [18:26:55] does anyone have the content of a blank article still up in browser, or fetched via curl, etc? for an idea on content-length? [18:27:06] I have one open in my browser [18:27:09] ostriches: http://fpaste.org/279225/84720014/raw/ [18:27:15] full query (NDA only): https://phabricator.wikimedia.org/P2197 [18:27:36] bblack: https://en.wikipedia.org/wiki/Throw-in [18:27:48] 6operations, 10hardware-requests: codfw/eqiad: (1) eventlogging node (per site) - https://phabricator.wikimedia.org/T90747#1726009 (10RobH) [18:27:49] 6operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1726007 (10RobH) [18:27:51] 6operations, 10hardware-requests: deploy eventlog2001 - https://phabricator.wikimedia.org/T90907#1726005 (10RobH) 5Resolved>3Open Service implementation never completed. I'll need to followup with Ori about this at a later time and find out what it is waiting on. (He may not have been the proper person t... [18:28:01] legoktm: So if page /starts/ with