[00:00:24] everything okay, Dereckson? [00:00:45] yup, we wait https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/2588/console [00:01:01] Change merged. [00:02:11] so git log HEAD..origin/wmf/1.27.0-wmf.20 gives only the new commit, this is fine too. [00:02:24] yes [00:02:26] I sometimes find that I have to wait so long for Jenkins that I've forgotten I was deploying by the time it's done and don't notice for a few minutes [00:02:52] also I don't think we need to wait for phpunit tests for js-only changes [00:03:56] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80460 MB (15% inode=99%) [00:05:03] Okay everything looks good on Tin in core/ext Git histories, we can sync-file now. [00:05:25] yep, looks good [00:07:41] !log dereckson@tin Synchronized php-1.27.0-wmf.20/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTargetLoader.js: Check wpSection before converting textbox contents for use in VE ([[Gerrit:282827]]) (duration: 00m 25s) [00:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:48] Krenair: ^ [00:08:19] thanks [00:09:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:11:07] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79719 MB (15% inode=99%) [00:11:48] shards are moving off elastic1001 ... it will be happy whenever that finishes [00:11:58] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/collation/IcuCollation.php: Ie9799f5ea: Cache first-letter data in APC, if available (duration: 00m 25s) [00:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:21] Krenair: works fine? [00:13:34] Still waiting for VE to load in debug mode [00:13:49] am watching the list of JS files scroll by as they get downloaded [00:14:05] Dereckson, yep, works. Thanks! [00:14:32] This so closes the SWAT. [00:25:04] (03CR) 10EBernhardson: "actually this just ends up completely failing for the logstash servers, since version is a required variable. Updating to fix." [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [00:26:33] (03PS2) 10EBernhardson: Specify specific version of elasticsearch package [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) [00:27:07] (03PS3) 10EBernhardson: Specify specific version of elasticsearch package [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) [00:27:27] RECOVERY - Disk space on elastic1001 is OK: DISK OK [00:30:25] (03CR) 10EBernhardson: "puppet compiler seems happier now: http://puppet-compiler.wmflabs.org/2399/" [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [00:32:13] can you specify something non-specifically? [00:32:31] :P [00:34:54] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/page/WikiPage.php: Ie9799f5ea: Flag triggerOpportunisticLinksUpdate() behind $wgMiserMode (duration: 00m 31s) [00:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:56] !log Deleted refreshLinksDynamic jobs for commonswiki and enwiki (T132318) [00:37:57] T132318: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318 [00:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:51:46] (03PS4) 10Yuvipanda: elasticsearch: Pin elasticsearch package to specific version [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [01:38:28] ori: Ie9799f5ea? It's Ieeaf047e74 [02:24:51] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 11m 20s) [02:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 12 02:33:25 UTC 2016 (duration 8m 34s) [02:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:27] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [03:36:34] 06Operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#2197918 (10Jalexander) >>! In T109810#2196082, @Dzahn wrote: > @Jalexander sorry, i just assigned directly because i wasn't sure what project to add and made T132373 to request a new project for it. makes sens... [03:44:17] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [03:48:22] (03PS1) 10Dereckson: Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) [04:02:47] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80758 MB (15% inode=99%) [04:11:56] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80766 MB (15% inode=99%) [04:22:46] RECOVERY - Disk space on elastic1001 is OK: DISK OK [05:39:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0] [06:19:27] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:21:07] 06Operations, 10Analytics, 10DNS, 10Traffic: Create data.wikimedia.org - https://phabricator.wikimedia.org/T132407#2198010 (10Peachey88) [06:21:17] 06Operations, 10Analytics, 10DNS, 10Traffic: Create data.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Peachey88) Will it be a wiki? microsite? [06:30:27] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:22] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198015 (10Joe) So, a few points: * If the MBR was wiped, it was not wiped by the installation process (the MBR stays untouc... [06:47:06] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:50] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2196527 (10MoritzMuehlenhoff) It's really strange that it initiated a PXE install, both wdqs* servers were rebooted w/o prob... [06:50:59] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2198040 (10elukey) [06:51:41] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198042 (10Joe) Why were we rebooting this system to start with? [06:55:31] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198043 (10Smalyshev) AFAIK something to do with kernel upgrade, @Gehel should know the details. [06:55:39] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2198044 (10Joe) 05Open>03Resolved [06:56:06] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:11] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198046 (10MoritzMuehlenhoff) The reboot was for the upgrade to Linux 4.4 [06:57:18] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:47] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:26] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198049 (10Volans) @Joe FYI: I was able yesterday to get a shell in the installer and mount /dev/sda1 and there was a boot pa... [07:06:02] (03CR) 10Raimond Spekking: [C: 04-1] "I see not need for it as long as a community doesn't request such a limit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [07:08:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [07:10:29] 06Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2198070 (10elukey) [07:12:10] apergos: Anything on https://phabricator.wikimedia.org/T127793? Possible to add estimated time there? [07:13:10] kart_: thanks for the ping! will add notes today. [07:14:06] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:27] 06Operations, 10Analytics: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2198076 (10elukey) a:05Ottomata>03elukey [07:14:52] 06Operations, 06Analytics-Kanban: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#798408 (10elukey) [07:16:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [07:36:34] (03PS3) 10Muehlenhoff: Manage /etc/pam.d/sshd in role::bastionhost::2fa via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 [07:43:57] (03PS1) 10ArielGlenn: back off of hhvm for dumps due to xmlreader + compress.bz2 issues [puppet] - 10https://gerrit.wikimedia.org/r/282857 [07:45:23] (03CR) 10ArielGlenn: [C: 032] back off of hhvm for dumps due to xmlreader + compress.bz2 issues [puppet] - 10https://gerrit.wikimedia.org/r/282857 (owner: 10ArielGlenn) [07:48:48] (03PS4) 10Muehlenhoff: Manage /etc/pam.d/sshd in role::bastionhost::2fa via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 [07:48:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Manage /etc/pam.d/sshd in role::bastionhost::2fa via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 (owner: 10Muehlenhoff) [07:50:40] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2198086 (10ArielGlenn) Have backed off of hhvm use for dumps for now. The current run will continue with regular php5. [07:52:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [07:54:06] PROBLEM - Apache HTTP on mw1111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50396 bytes in 1.385 second response time [07:55:47] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.054 second response time [07:59:27] PROBLEM - DPKG on restbase1014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:59:50] (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: remove the warmup job [puppet] - 10https://gerrit.wikimedia.org/r/282859 [08:00:18] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [08:00:20] mw1111 --> hhvm main process (8150) killed by SEGV signal [08:00:22] !log Set rpl_semi_sync_master_timeout=100 on db1038 T131753 (filling up erro log) [08:00:23] T131753: Semi-synchronous replication status in all MySQL production clusters - https://phabricator.wikimedia.org/T131753 [08:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:01:26] (03CR) 10jenkins-bot: [V: 04-1] add status display for dumpadmin script [dumps] - 10https://gerrit.wikimedia.org/r/282860 (owner: 10ArielGlenn) [08:01:51] !log restarted hhvm on mw1111 - multiple SEGV signals, hhvm-dump-debug in /tmp/hhvm.8424.bt. [08:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:03:17] RECOVERY - DPKG on restbase1014 is OK: All packages OK [08:03:54] (03PS2) 10ArielGlenn: add status display for dumpadmin script [dumps] - 10https://gerrit.wikimedia.org/r/282860 [08:05:42] (03CR) 10ArielGlenn: [C: 032] add status display for dumpadmin script [dumps] - 10https://gerrit.wikimedia.org/r/282860 (owner: 10ArielGlenn) [08:05:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:14:56] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:20:07] (03PS1) 10Muehlenhoff: Enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/282865 [08:20:54] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2192991 (10mobrovac) This is likely because `.git/DEPLOY_HEAD` does not exist for these repos on `deployment-tin` [08:21:51] (03PS1) 10ArielGlenn: add debdeploy and admin group configs for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282866 [08:22:34] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198117 (10mobrovac) a:03mobrovac *sigh*. A simple `service X... [08:22:44] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198120 (10mobrovac) p:05Triage>03Normal [08:31:51] (03PS3) 10Ema: Add icinga monitoring for varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282488 (https://phabricator.wikimedia.org/T131760) (owner: 10Adedommelin) [08:32:37] (03CR) 10Ema: [C: 032 V: 032] "Looks good to me! Thanks Alexandre." [puppet] - 10https://gerrit.wikimedia.org/r/282488 (https://phabricator.wikimedia.org/T131760) (owner: 10Adedommelin) [08:35:13] !log installing ntp bugfix updates (will trigger some ntpd Icinga warnings until the clocks have resynced) [08:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:16] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:17] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [08:41:17] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:18] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:18] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:18] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:18] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:18] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:26] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:27] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:27] PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:27] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [08:41:35] <_joe_> this is me reenabling puppet... [08:41:47] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:48] PROBLEM - puppet last run on mc1013 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:41:48] PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:06] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:16] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:18] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:27] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:27] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:27] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [08:42:27] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:27] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [08:42:27] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:27] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:37] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:38] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:46] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:46] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:42:57] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [08:43:06] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:43:07] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:38] RECOVERY - puppet last run on mc2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:17] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:20] 06Operations, 06Analytics-Kanban: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2198145 (10elukey) @Dzahn: agreed! Moved the ticket to our "Next up" column and assigned to me, I believe it was something I should have done in the beginning as "training" :) I double checked and th... [08:44:57] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:48:46] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:27] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:50:57] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:51:37] (03CR) 10Elukey: [C: 032] Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [08:51:49] (03CR) 10Elukey: [V: 032] Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [08:52:07] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:52:46] (03Abandoned) 10Elukey: fix up argument for syslog reload [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/277220 (owner: 10ArielGlenn) [08:53:09] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:53:47] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:54:07] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:54:08] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:55:17] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:57:16] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:27] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:58:07] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:58:58] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:59:47] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:00:11] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198176 (10Joe) so, it could be possible that the package upgrade failed to correctly install grub to the mbr? [09:00:56] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:01:07] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:01:55] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:04:56] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:04:57] RECOVERY - puppet last run on mc1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:05:16] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:05:17] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:05:21] (03PS1) 10Elukey: Package latest upstream (1.9.0-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282876 (https://phabricator.wikimedia.org/T129344) [09:06:22] (03Abandoned) 10Elukey: Package latest upstream (1.9.0-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282876 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [09:06:55] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:07:16] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:08:16] (03PS1) 10Elukey: Package latest upstream (1.9.0-1) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282878 (https://phabricator.wikimedia.org/T129344) [09:08:35] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:12:05] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:12:27] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:17:52] mutante: have you done NTP for codfw now? :) [09:19:16] 06Operations: Build ircd-ratbox for jessie - https://phabricator.wikimedia.org/T132427#2198211 (10MoritzMuehlenhoff) [09:19:35] 06Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2198225 (10MoritzMuehlenhoff) [09:19:37] 06Operations: Build ircd-ratbox for jessie - https://phabricator.wikimedia.org/T132427#2198226 (10MoritzMuehlenhoff) [09:20:35] volans: yeah, db-core-codfw in debdeploy [09:21:06] !log varnishstatsd-default restarted on cp4010 and cp4018 [09:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:17] ops... pressed tab too early on the m... thanks moritzm [09:25:16] (03PS1) 10Giuseppe Lavagetto: switchover: stop jobrunners in eqiad [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282880 [09:25:18] (03PS1) 10Giuseppe Lavagetto: switchover: make jobrunners in codfw start up [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282881 [09:28:30] 06Operations: Build ircd-ratbox for jessie - https://phabricator.wikimedia.org/T132427#2198211 (10Peachey88) What custom patches do we include in the ratbox package? [09:32:20] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198273 (10ema) [09:32:30] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198285 (10ema) p:05Triage>03Normal [09:36:31] (03CR) 10Ema: [C: 032 V: 032] Package latest upstream (1.9.0-1) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282878 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [09:43:16] (03PS1) 10ArielGlenn: [WIP] add dumps for flow pages for those wikis which have Flow enabled [dumps] - 10https://gerrit.wikimedia.org/r/282883 [09:49:57] (03PS1) 10Muehlenhoff: Add missing build-dependencies on bison and flex [debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/282886 [09:54:01] (03PS1) 10Ema: Port varnishxcps to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/282887 (https://phabricator.wikimedia.org/T131353) [09:54:48] (03PS1) 10Filippo Giunchedi: Set synchronous swift writes for eqiad/codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282888 (https://phabricator.wikimedia.org/T129089) [09:55:37] !log Updated Wikidata's property suggester with data from Monday's json dump [09:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing build-dependencies on bison and flex [debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/282886 (owner: 10Muehlenhoff) [10:00:34] 06Operations, 10vm-requests, 13Patch-For-Review: eqiad: VM request for archiva - https://phabricator.wikimedia.org/T131358#2198294 (10akosiaris) 05Open>03Resolved a:03akosiaris VM is up and running. puppet/salt done. Resolving [10:06:17] (03PS1) 10Filippo Giunchedi: varnish: route upload backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282890 (https://phabricator.wikimedia.org/T129089) [10:11:43] (03PS1) 10Filippo Giunchedi: varnish: swift upload codfw to 'direct' [puppet] - 10https://gerrit.wikimedia.org/r/282891 (https://phabricator.wikimedia.org/T129089) [10:13:20] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [10:14:03] (03PS2) 10Filippo Giunchedi: varnish: switch upload codfw to 'direct' [puppet] - 10https://gerrit.wikimedia.org/r/282891 (https://phabricator.wikimedia.org/T129089) [10:14:05] (03PS1) 10Filippo Giunchedi: varnish: switch upload eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282892 (https://phabricator.wikimedia.org/T129089) [10:15:09] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198317 (10ema) Another crash, this time on cp4017: Apr 12 07:30:32 cp4017 varnishstatsd[18631]: Traceback (most recent ca... [10:16:26] !log restarted varnishstatsd-default.service on cp4017 (T132430) [10:16:28] T132430: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430 [10:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:23] (03PS1) 10Filippo Giunchedi: varnish: switch text 'rendering' to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282893 (https://phabricator.wikimedia.org/T129089) [10:23:54] (03PS1) 10Giuseppe Lavagetto: Make use of the local, not master, parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282894 [10:24:21] <_joe_> mobrovac: ^^ [10:27:38] hey, pgsql.eqiad.wmnet seems to be a shared host, accessible via labs. I want to check its logs. Is there a way? [10:27:50] I checked wikitech, there was nothing [10:29:01] 06Operations: Build ircd-ratbox for jessie - https://phabricator.wikimedia.org/T132427#2198324 (10MoritzMuehlenhoff) @Peachey88: See here: https://github.com/wikimedia/operations-debs-ircd-ratbox/blob/master/ircd-ratbox-notalk.patch [10:29:31] (03PS1) 10Ema: New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) [10:30:20] Amir1: check it's logs? [10:30:53] yeah, one of our services are actually uses it https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikilabels.pp [10:30:54] (03CR) 10Ema: [C: 031] misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) (owner: 10BBlack) [10:31:14] and it has https://phabricator.wikimedia.org/T130872 [10:31:35] (03PS1) 10Giuseppe Lavagetto: Switch wmfMasterDatacenter to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282897 [10:31:36] My guess is that it's a db overload issue [10:32:24] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Do not submit until during the actual switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282897 (owner: 10Giuseppe Lavagetto) [10:32:52] I downloaded 100MB of logs checked most of it, only possible thing is db being overloaded [10:33:24] Amir1: Is it in ganglia? [10:33:51] wikilabels or the pgsql? [10:33:59] pgsql [10:34:06] (03PS2) 10Ema: New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) [10:34:29] no idea, there is no mention of it anywhere in wikitech [10:35:14] that's actually quite terrifying we have a shared db in prod cluster [10:35:44] (03PS1) 10Giuseppe Lavagetto: switchover: switch the mediawiki master datacenter [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282898 [10:37:11] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2198329 (10elukey) ``` root@carbon:~# reprepro ls varnishkafka varnishkafka | 1.0.2-1 | precise-wikimedia | amd64, source varnishkafka | 1.0.6-1 | trusty-wikimedia |... [10:37:34] Amir1: pgsql.eqiad.wmnet is an alias for labsdb1004.eqiad.wmnet. [10:38:30] 10.64.37.8 [10:39:05] thanks, but since I don't have prod access I can't login into [10:39:23] <_joe_> Amir1: what do you really need? [10:39:30] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:39:39] <_joe_> I mean what do you need to look at pgsql for? [10:39:45] I want to check if that server was overloaded in certain times [10:40:01] <_joe_> why? [10:40:06] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikilabels.pp [10:40:06] _joe_: because this tool of ours is using it [10:40:16] <_joe_> sorry, I am trying to reconstruct what the real problem is [10:40:17] and it has this bug: https://phabricator.wikimedia.org/T130872 [10:40:21] Wikimedia Grid > MySQL eqiad > labsdb1004.eqiad.wmnet [10:40:34] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=MySQL+eqiad&h=labsdb1004.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [10:40:37] Looks pretty unbusy [10:41:11] 06Operations, 10procurement: Cancel cross connects to TiNet/GTT at ulsfo - https://phabricator.wikimedia.org/T132432#2198330 (10mark) [10:41:14] I checked logs, it happens in any kind of requests, any type, [10:41:19] <_joe_> Amir1: I would suggest you start by profiling the application by turning up logging [10:41:19] 06Operations, 10procurement: Cancel cross connects to TiNet/GTT at ulsfo - https://phabricator.wikimedia.org/T132432#2198343 (10mark) p:05Triage>03High [10:41:26] but quite random and while other workers are working fine [10:41:29] <_joe_> something tells me it could well be nfs [10:41:35] <_joe_> is this running on toollabs? [10:41:39] nope [10:41:45] wikilabels project [10:41:49] <_joe_> so just labs [10:41:57] yeah [10:42:21] <_joe_> I mean there could be a thousand reasons for this to fail, before we look at the db; even then, involve ops people from labs on the bug [10:42:45] hmm [10:43:37] I want to see if it's an issue on our service when I'm sure it's not an issue in our service I contact them [10:44:33] <_joe_> Amir1: I am not sure I understand; if you suspect the db is involved, it's most surely something ops should look at [10:44:52] <_joe_> that is of course if you have any evidence (or even a hint) that that's the actual issue [10:45:23] <_joe_> (ops should just because it's a service we provide to you) [10:45:54] (03CR) 10Luke081515: [C: 031] Import sources on hi.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282843 (https://phabricator.wikimedia.org/T132417) (owner: 10Dereckson) [10:46:02] <_joe_> btw, how long do those spikes last? [10:46:31] _joe_: my hint is that it happens that not related to any parameters of us [10:46:47] I mean in every type and request that db is involved [10:46:59] I had never such issue when db wasn't involved [10:47:13] <_joe_> so you can reproduce the problem? [10:47:30] <_joe_> does your logging/profiling tell you what's the function in which you spend most of the time? [10:47:30] it happens in logs [10:47:41] some requests between tons of requests [10:47:42] <_joe_> those are all informations we could use to debug the problem [10:47:51] (e.g. 37 request in 25K) [10:48:05] <_joe_> uh so let me understand [10:48:27] _joe_: I have another hint too. I almost stopped after April 6 [10:48:38] <_joe_> you have 0.2% of requests that are very slow [10:48:38] (down to 1 or two in 25K) [10:48:55] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=labsdb1004.eqiad.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1460457642&g=load_report&z=large&c=MySQL%20eqiad [10:49:30] that shows it's related to load of the server [10:49:35] <_joe_> this is a correlation, yes [10:50:14] for an exact number, I can dig deeper [10:50:19] <_joe_> well, 1 request out of 25 K is 0.004% of requests [10:50:28] <_joe_> if that is the size of the issue [10:50:46] and that 1 request was a huge request [10:50:52] <_joe_> I would consider it pretty minor and hard to pin, but it's certainly related to the db and you should involve labs ops [10:50:54] totally understandable [10:50:57] <_joe_> uhm [10:51:12] <_joe_> so you're telling me it was the db and it's now solved :) [10:51:41] but it happened much much more before [10:51:55] <_joe_> the db overload can totally explain it [10:52:00] <_joe_> given the timing correlates [10:52:11] my fear is that it happens again once the load on the server goes up [10:52:28] <_joe_> Amir1: well, if the db is overloaded, apps suffer [10:52:31] <_joe_> it's expected [10:53:08] yeah :D [10:53:15] <_joe_> and still, a 0.15% of slow requests is a degradation for sure, report that to the labs people [10:53:43] okay [10:53:46] thanks for the tip [10:54:05] Reedy: thanks for the chart. it was extremely helpful [10:54:06] <_joe_> add the graphs from ganglia and the data you just gave me to the ticket [10:54:17] sure [10:54:41] _joe_: one thing, unrelated. Do you know what is the status of ops moving to prod? [10:54:56] <_joe_> I guess you mean Ores? [10:54:59] I have a scap3 patch there, waiting to be merged :D [10:55:04] haha [10:55:07] yes, sorry [10:55:21] <_joe_> Amir1: is akosiaris aware of it? [10:55:44] (03CR) 10BBlack: [C: 031] New misc VTC test: 09-chunked-response-add-cl.vtc [puppet] - 10https://gerrit.wikimedia.org/r/282895 (https://phabricator.wikimedia.org/T128813) (owner: 10Ema) [10:55:50] I guess so, I pinged him yesterday but I don't want to over ping him [10:55:59] (he hasn't answered) [10:56:12] <_joe_> so you're pinging me instead so that I can ping him? :P [10:56:40] I thought you might know something [10:59:19] thanks anyway _joe_ :) [11:00:25] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198388 (10BBlack) It looks like various crashes like these have been happening for a while, and puppet runs are what normall... [11:03:34] !log repool restbase2005 / depool restbase2006 [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:50] !log start raid expansion on restbase2006 T127951 [11:10:51] T127951: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951 [11:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:31] (03PS5) 10BBlack: delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [11:18:57] (03PS2) 10BBlack: Remove puppet/recursor0/recursor1.esams CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/281117 (owner: 10Faidon Liambotis) [11:19:10] !log running CentralAuth/checkLocalUsers.php --delete on all wikis for T119736 [11:19:11] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [11:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:33] (03CR) 10BBlack: [C: 032] Remove puppet/recursor0/recursor1.esams CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/281117 (owner: 10Faidon Liambotis) [11:21:53] (03PS6) 10BBlack: delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [11:22:37] (03CR) 10BBlack: [C: 031] delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [11:30:28] (03PS1) 10Muehlenhoff: Add a Conflicts on whois (filename clash with /usr/bin/mkpasswd) [debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/282902 (https://phabricator.wikimedia.org/T132427) [11:31:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a Conflicts on whois (filename clash with /usr/bin/mkpasswd) [debs/ircd-ratbox] (debian) - 10https://gerrit.wikimedia.org/r/282902 (https://phabricator.wikimedia.org/T132427) (owner: 10Muehlenhoff) [11:33:52] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2198426 (10BBlack) [11:34:17] !log Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11 [11:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:40] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [11:35:01] <_joe_> bblack: might have something to do with your changes? ^^ [11:35:13] <_joe_> checking [11:35:37] not that I'm aware of, but looking [11:35:38] <_joe_> nope [11:35:44] <_joe_> it's a puppet master fail [11:36:00] ok [11:37:24] !log Restarting Jenkins [11:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:01] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2198434 (10BBlack) [11:38:05] 06Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2198433 (10BBlack) [11:43:14] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2198443 (10BBlack) [11:43:16] 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#2198441 (10BBlack) 05Open>03declined With the impending removal of *.donate, we'll actually finally be able to HSTS wikimedia.org itself at the DNS level. [11:43:47] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411365 (10BBlack) [11:43:50] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2198444 (10BBlack) [11:44:16] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS: Check all wikis for inclusions of http resources on https - https://phabricator.wikimedia.org/T36670#2198447 (10BBlack) [11:44:18] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#1142785 (10BBlack) [11:47:13] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2198449 (10BBlack) Note: this is already the case with two exceptions: 1. We're not sending HSTS (or even forcing HTTPS) on all of cache_misc's service hostnames yet 2. We're not sending HSTS for the *... [11:47:29] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2198451 (10BBlack) [11:47:31] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2198452 (10BBlack) [11:49:04] Amir1: I am around, what is that you would like to know ? [11:49:11] 06Operations, 07Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2198454 (10akosiaris) @dzahn. We 've already got replacement boxes. See T121583 and T121582 respectively for codfw and eqiad. But don't just reuse einsteinium to replace neon, the idea is to manage to kill neo... [11:49:32] PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:50:40] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2198458 (10BBlack) [11:52:58] (03Abandoned) 10Muehlenhoff: Enable base::firewall on terbium [puppet] - 10https://gerrit.wikimedia.org/r/280627 (owner: 10Muehlenhoff) [11:54:17] akosiaris: hey, I don't want to bother you, I just really want to know what is the status of ores moving to prod, is there naything I can do to fasten the process or is there eta of deployment [11:56:52] Amir1: the status is I am creating service::python as a sibling of service::node in order to get all the goodies we get from that (automatic monitoring, automatic firewalling, scap3 and all that) then amend the beta classes to use it, make sure it works and then apply it to production [11:57:15] I am almost done with that, but I have no real eta right now [11:57:40] oh, amazing [11:57:44] thank you alex [11:57:49] and sorry to bother you [11:58:03] can you keep me posted if anything even minor happens? [11:58:55] er... minor ? [11:59:02] I lost you there... what do you mean ? [11:59:26] any progress [11:59:46] that might end up being way more noise that it's worth for both of us [11:59:58] may I suggest you let me use good common sense ? [11:59:59] okay [12:00:09] cool [12:00:10] thanks [12:00:25] thanks [12:03:01] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:08:54] !log uploaded ircd-ratbox 2.2.9-3 for jessie-wikimedia to carbon [12:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:48] 06Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2198502 (10MoritzMuehlenhoff) [12:13:30] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures [12:13:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:13:47] (03CR) 10Mobrovac: [C: 031] Make use of the local, not master, parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282894 (owner: 10Giuseppe Lavagetto) [12:15:51] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:16:25] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#1351184 (10Lokal_Profil) I like to join #Project-Creators to be able to setup/coordinate/clean-up projects for Wikimedia Sverige. @Ainali... [12:16:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:18:02] (03PS1) 10BBlack: Common VCL: enable https redirect/HSTS code for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/282907 [12:18:04] (03PS1) 10BBlack: HTTPS: redirect + HSTS-preload for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/282908 [12:18:06] (03PS1) 10BBlack: Misc VCL: re-use common 751 code for HTTPS redirects [puppet] - 10https://gerrit.wikimedia.org/r/282909 [12:20:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:21:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:22:26] (03PS1) 10BBlack: switchover: switch api/appservers.svc varnish routing from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282910 [12:31:31] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.002 second response time on port 9042 [12:37:02] 06Operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#2198557 (10MoritzMuehlenhoff) [12:37:04] 06Operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#2198555 (10MoritzMuehlenhoff) 05stalled>03Resolved Basic setup is done, all further auth server/2FA developments takes place on other tickets. [12:37:20] 06Operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1814545 (10MoritzMuehlenhoff) [12:37:22] 06Operations, 10ops-codfw, 06DC-Ops: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#2198558 (10MoritzMuehlenhoff) 05stalled>03Resolved Basic setup is done, all further auth server/2FA developments takes place on other tickets. [12:38:03] 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#2198576 (10Chmarkine) >>! In T111967#2198441, @BBlack wrote: > With the impending removal of *.donate, we'll actually finally be able to HSTS wikimedia.org itself at the DNS le... [12:39:50] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:41:01] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198613 (10elukey) The errors seems to be related to two tags: ``` 110 elif tag == 'BackendXID': 111 # Associate... [12:43:25] (03PS2) 10Muehlenhoff: Enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/282865 [12:47:47] 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#2198642 (10BBlack) Right: "at the DNS level" means it's fixed in terms of not having bad DNS names with no matching certs. We still have services that don't support HTTPS, or... [12:47:56] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/282865 (owner: 10Muehlenhoff) [12:54:41] (03PS1) 10BBlack: cache_misc: remove download.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282912 [12:55:01] (03PS1) 10BBlack: cache_misc: remove rt.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282914 [12:55:13] (03PS1) 10BBlack: cache_misc: remove gerrit.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282915 [12:55:40] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198669 (10elukey) ``` /usr/local/lib/python2.7/dist-packages/varnishlog.py 100 def _vsl_handler(priv, tag_id, fd, lengt... [13:07:38] (03CR) 10Mobrovac: [C: 031] Normalize REST API Accept headers [puppet] - 10https://gerrit.wikimedia.org/r/281042 (https://phabricator.wikimedia.org/T128040) (owner: 10GWicke) [13:16:03] (03PS1) 10BBlack: cache_misc: invert TLS redir host check [puppet] - 10https://gerrit.wikimedia.org/r/282919 [13:16:05] (03PS1) 10BBlack: cache_misc: TLS-redirect existing redirects [puppet] - 10https://gerrit.wikimedia.org/r/282920 [13:21:36] Could anybody run mwscript eraseArchivedFile.php --wiki commonswiki --filename 'Sajid-kpanda.webm' --filekey '*' for T132419? This file is copyright violation so we should remove it ASAP. Thanks [13:21:37] T132419: Files are still in the cache after deletion - https://phabricator.wikimedia.org/T132419 [13:22:31] ugh [13:23:00] let me see who did that the last time this happened [13:23:06] hmm, bblack, you around? ^ [13:23:30] Urbanecm: hmm, why do you think eraseArchivedFile.php will help? [13:23:44] https://wikitech.wikimedia.org/wiki/Media_storage [13:24:01] previously these were caused by cache issues, IIRC [13:24:17] IIRC is what? [13:24:30] (if i remember correctly) [13:24:59] this file is deleted on commonswiki, no wiki uses it, so it's the cache. Only way how to get it is through upload.wikimedia.org [13:26:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2198759 (10Gehel) a:03Gehel [13:26:43] Urbanecm_: eraseArchivedFile.php is for highly sensitive/illegal files that need to be fully wiped from the server [13:27:31] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2198788 (10matmarex) [13:28:28] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2198793 (10matmarex) I made the task public, with this many duplicates filed there is really no reason not to.... [13:29:11] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2198796 (10ArielGlenn) {F3865570} This seems to be the best I can do for a stack trace; what am I missing? [13:30:13] (03PS4) 10BBlack: misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) [13:30:31] (03CR) 10BBlack: [C: 032 V: 032] misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) (owner: 10BBlack) [13:31:39] Urbanecm: i poked a few people and we'll get it nuked in a minute, thanks for the report [13:34:16] thanks. At 14:00 I should be here from PC and be fully available . [13:34:51] (UTC timezone) [13:36:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Blocked on adding the relevant ferm rules in the role." [puppet] - 10https://gerrit.wikimedia.org/r/282659 (owner: 10Muehlenhoff) [13:36:54] Urbanecm__ / MatmaRex : it's banned on the upload cluster [13:37:43] bblack: thanks, looks gone :) [13:38:17] (03PS2) 10BBlack: Common VCL: enable https redirect/HSTS code for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/282907 [13:38:46] bblack: hmm, can you document what you did on https://phabricator.wikimedia.org/T109331 , so that we can have anyone do it the next time? [13:38:58] (or is it documented already somewhere?) [13:39:11] it's already documented [13:39:21] So it's done? Thanks for fast processing. [13:39:28] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_.28bans.29 [13:39:44] (03PS1) 10ArielGlenn: move outdated dumps deploy scripts to obsolete dir, update tox.ini [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/282924 [13:39:52] Things should be documented on wikitech, not phab - easier to find [13:40:52] (03PS2) 10ArielGlenn: move outdated dumps deploy scripts to obsolete dir, update tox.ini [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/282924 [13:41:16] p858snake|L_: they are [13:42:24] (03CR) 10ArielGlenn: [C: 032] move outdated dumps deploy scripts to obsolete dir, update tox.ini [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/282924 (owner: 10ArielGlenn) [13:42:31] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: fix ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/282925 [13:42:36] <_joe_> moritzm: ^^ [13:42:47] having a look [13:43:00] (03Abandoned) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 (owner: 10ArielGlenn) [13:44:13] 06Operations, 10Analytics, 10DNS, 10Traffic: Create data.wikimedia.org - https://phabricator.wikimedia.org/T132407#2198830 (10MZMcBride) >>! In T132407#2198012, @Peachey88 wrote: > Will it be a wiki? microsite? redirect to a page on meta? It sounds like this is more of a request for a server to host/run t... [13:44:41] (03Abandoned) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn) [13:44:44] (03CR) 10Jgreen: [C: 031] update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [13:47:30] (03PS2) 10Chad: Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 [13:48:21] 06Operations, 10DNS, 10Fundraising-Backlog, 10Traffic, and 2 others: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2151449 (10BBlack) Before I merge the patch above: is all of the outbound benefactors email stopped for the duration of... [13:49:10] (03CR) 10Muehlenhoff: [C: 031] "Looks good, also doublechecked with PCC" [puppet] - 10https://gerrit.wikimedia.org/r/282925 (owner: 10Giuseppe Lavagetto) [13:51:03] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: fix ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/282925 [13:51:36] <_joe_> moritzm: I fixed a copypasta error, I'm going to merge this change since it's a noop [13:51:49] sounds good [13:51:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::jobqueue_redis: fix ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/282925 (owner: 10Giuseppe Lavagetto) [13:53:56] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2198859 (10ema) >>! In T132430#2198388, @BBlack wrote: > 1. Fix the crashes Yep. > 2. systemd should restart it unless it's... [13:54:38] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is now unblocked" [puppet] - 10https://gerrit.wikimedia.org/r/282659 (owner: 10Muehlenhoff) [13:57:19] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Empty result on a tree query - https://phabricator.wikimedia.org/T127014#2198865 (10BBlack) [13:57:25] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2198862 (10BBlack) 05Open>03Resolved a:03BBlack This should be resolved now, modulo perhaps some cache entries that need to fall out in the near future... [13:58:06] (03PS1) 10Giuseppe Lavagetto: role::memcached: fix reference to replication [puppet] - 10https://gerrit.wikimedia.org/r/282930 [14:04:00] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [14:04:21] (03PS3) 10BBlack: Common VCL: enable https redirect/HSTS code for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/282907 [14:05:12] (03CR) 10BBlack: [C: 032 V: 032] Common VCL: enable https redirect/HSTS code for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/282907 (owner: 10BBlack) [14:05:21] that's me ^ graphite1001 [14:05:38] (03PS2) 10Giuseppe Lavagetto: role::memcached: fix reference to replication [puppet] - 10https://gerrit.wikimedia.org/r/282930 [14:06:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::memcached: fix reference to replication [puppet] - 10https://gerrit.wikimedia.org/r/282930 (owner: 10Giuseppe Lavagetto) [14:06:19] (03PS2) 10BBlack: HTTPS: redirect + HSTS-preload for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/282908 [14:07:50] (03CR) 10BBlack: [C: 032] HTTPS: redirect + HSTS-preload for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/282908 (owner: 10BBlack) [14:08:21] 06Operations, 10DNS, 10Fundraising-Backlog, 10Traffic, and 2 others: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2198881 (10CCogdill_WMF) Yes @BBlack, you're good to go. We aren't pushing any events right now so the system is dormant. [14:08:23] ori: there's uncommitted changes on graphite1001:/srv/org/wikimedia/performance btw, makes puppet fail [14:08:25] (03PS2) 10BBlack: Misc VCL: re-use common 751 code for HTTPS redirects [puppet] - 10https://gerrit.wikimedia.org/r/282909 [14:09:17] (03PS4) 10BBlack: update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [14:09:48] (03CR) 10BBlack: [C: 032] Misc VCL: re-use common 751 code for HTTPS redirects [puppet] - 10https://gerrit.wikimedia.org/r/282909 (owner: 10BBlack) [14:10:15] (03CR) 10BBlack: [C: 032] update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [14:11:17] 06Operations, 10DNS, 10Fundraising-Backlog, 10Traffic, and 2 others: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2198884 (10BBlack) 05Open>03Resolved [14:12:33] 06Operations, 10Analytics, 10DNS, 10Traffic: Create data.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Milimetric) There's some debate about this. We haven't used data.wikimedia.org in the past because of the possible confusion with wikidata. The wmflabs domain seems less "producti... [14:12:49] (03PS2) 10BBlack: cache_misc: invert TLS redir host check [puppet] - 10https://gerrit.wikimedia.org/r/282919 [14:13:52] 06Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#2198910 (10Dzahn) [14:13:54] 06Operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#2198911 (10Dzahn) [14:13:56] 06Operations, 07Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2198909 (10Dzahn) [14:14:02] (03CR) 10BBlack: [C: 032] cache_misc: invert TLS redir host check [puppet] - 10https://gerrit.wikimedia.org/r/282919 (owner: 10BBlack) [14:14:04] 06Operations, 07Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) 05Open>03stalled [14:14:11] (03PS2) 10BBlack: cache_misc: TLS-redirect existing redirects [puppet] - 10https://gerrit.wikimedia.org/r/282920 [14:14:23] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: TLS-redirect existing redirects [puppet] - 10https://gerrit.wikimedia.org/r/282920 (owner: 10BBlack) [14:15:30] (03CR) 10Ema: [C: 031] cache_misc: remove download.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282912 (owner: 10BBlack) [14:16:41] ACKNOWLEDGEMENT - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi uncommitted changes in /srv/org/wikimedia/performance [14:18:20] (03PS1) 10Elukey: Add stat1004 configuration to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) [14:18:32] (03PS2) 10Muehlenhoff: Enable base::firewall on rdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/282659 [14:19:47] (03PS1) 10BBlack: Common VCL: enable 751 handling for varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/282938 [14:20:03] (03CR) 10BBlack: [C: 032 V: 032] Common VCL: enable 751 handling for varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/282938 (owner: 10BBlack) [14:21:17] (03CR) 10Elukey: "@Ottomata: Hopefully this should be ok, but I still don't get the magic of the admin::groups (didn't find them in the roles)" [puppet] - 10https://gerrit.wikimedia.org/r/282936 (https://phabricator.wikimedia.org/T131877) (owner: 10Elukey) [14:22:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on rdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/282659 (owner: 10Muehlenhoff) [14:23:08] (03PS3) 10Muehlenhoff: Enable base::firewall on rdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/282659 [14:23:15] (03CR) 10Muehlenhoff: [V: 032] Enable base::firewall on rdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/282659 (owner: 10Muehlenhoff) [14:25:35] (03CR) 10Ema: [C: 031] cache_misc: remove rt.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282914 (owner: 10BBlack) [14:25:47] (03PS2) 10BBlack: cache_misc: remove download.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282912 [14:26:02] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove download.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282912 (owner: 10BBlack) [14:26:16] (03PS2) 10BBlack: cache_misc: remove rt.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282914 [14:26:23] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove rt.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282914 (owner: 10BBlack) [14:27:16] (03CR) 10Ema: [C: 031] "LGTM, we should remove the conditional for gerrit.w.o from templates/varnish/misc-common.inc.vcl.erb as well." [puppet] - 10https://gerrit.wikimedia.org/r/282915 (owner: 10BBlack) [14:27:56] 07Puppet, 10Beta-Cluster-Infrastructure, 06Services, 15User-mobrovac: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2198924 (10Krenair) Thank you @mobrovac [14:29:02] (03CR) 10Alex Monk: "should the admin config be moved somewhere more generic so things like this don't get missed?" [puppet] - 10https://gerrit.wikimedia.org/r/282866 (owner: 10ArielGlenn) [14:30:06] (03PS2) 10BBlack: cache_misc: remove gerrit.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282915 [14:31:21] 06Operations: enable https for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10Dzahn) [14:33:01] 06Operations, 10Traffic, 07HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#1621104 (10Dzahn) >>! In T111967#2198576, @Chmarkine wrote: > * https://status.wikimedia.org/ -- cert mismatch T34796 but stalled > * https://mirrors.wikimedia.org/ -- canno... [14:33:37] 06Operations, 10Traffic, 07HTTPS: enable https for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T132450#2198960 (10Reedy) [14:33:51] (03PS1) 10Muehlenhoff: Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 [14:35:58] (03PS3) 10BBlack: cache_misc: remove gerrit.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282915 [14:36:05] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove gerrit.wm.o (not misc in DNS) [puppet] - 10https://gerrit.wikimedia.org/r/282915 (owner: 10BBlack) [14:38:23] (03PS1) 10BBlack: Disable ulsfo T128424 [dns] - 10https://gerrit.wikimedia.org/r/282943 [14:40:47] 06Operations, 06Analytics-Kanban: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2198968 (10Dzahn) >>! In T76348#2198145, @elukey wrote: > the only thing that worries me a bit are all the /home directories with stuff in them, that probably will need to be backupped somewhere befor... [14:44:17] (03PS1) 10BBlack: add DYNA for wmfusercontent.org->misc [dns] - 10https://gerrit.wikimedia.org/r/282945 [14:44:35] working from "standing in line at the DMV". laptops rule [14:44:38] (03PS1) 10Gehel: remove wdqs1002 from varnish during reinstall / fix [puppet] - 10https://gerrit.wikimedia.org/r/282946 (https://phabricator.wikimedia.org/T132387) [14:44:49] but cant get to my other IRC nick in case you pinged me [14:45:26] mutante_: any chance we can coordinate -> merge on https://gerrit.wikimedia.org/r/#/c/278353/ today (the *.donate DNS delete) [14:45:53] (03CR) 10BBlack: [C: 032] add DYNA for wmfusercontent.org->misc [dns] - 10https://gerrit.wikimedia.org/r/282945 (owner: 10BBlack) [14:48:18] bblack: yes, ..well.. everybody says +1 basically, except we were nervous to do it right before vacation / on a Friday and she asked " Then we should have 48 hours planned without sending any email." [14:48:40] ok [14:48:52] i'll update the ticket and ask her [14:49:01] it sounded on the ticket like they could do it this week if notified, but I donno if they need a certain window of advance notice first [14:49:37] 06Operations, 06Analytics-Kanban: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2198981 (10elukey) @Dzahn another solution (more a hack) is to open a port / disable ferm for a bit (/me hides from @MoritzMuehlenhoff) and use netcat/openssl to copy files between hosts. [14:50:20] bblack, yes i was looking for this : "Tuesday or Friday would be best for our send schedule, but I can be flexible with a little heads up :)" [14:50:27] so Tuesday, heh [14:51:00] i'm just in line at the DMV right now and it starts moving finally [14:51:15] (03PS1) 10Ema: Update misc VTC test cases (all HTTPS, no gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/282947 [14:51:38] 06Operations, 10Traffic: HSTS preload for wmfusercontent.org - https://phabricator.wikimedia.org/T132452#2198983 (10BBlack) [14:51:51] (03PS2) 10Ema: Update misc VTC test cases (all HTTPS, no gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/282947 [14:52:00] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::instance: vary description based on instance name [puppet] - 10https://gerrit.wikimedia.org/r/282949 [14:52:02] 06Operations, 10Traffic: HSTS preload for wmfusercontent.org - https://phabricator.wikimedia.org/T132452#2198997 (10BBlack) [14:52:03] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: add monitoring of the redis instances [puppet] - 10https://gerrit.wikimedia.org/r/282950 [14:52:05] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2198996 (10BBlack) [14:52:39] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2198998 (10Dzahn) Hi @CCogdill_WMF i'm back and you said Tuesday or Friday are good days for your s... [14:53:04] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2198999 (10BBlack) I should've added: 3. We're also not convered for redirect/HSTS on all the non-cache_misc direct services [14:55:02] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199003 (10CCogdill_WMF) Sure @Dzahn, I can make today work. [14:58:06] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2199012 (10ema) varnishstatsd has now crashed on cp4009, cp4017 and cp4018. For some reason it only seems to be crashing in u... [14:59:00] (03PS2) 10Giuseppe Lavagetto: redis::monitoring::instance: vary description based on instance name [puppet] - 10https://gerrit.wikimedia.org/r/282949 [14:59:56] * James_F waves in anticipation. [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160412T1500). [15:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] Tah-dah. [15:00:35] <_joe_> oh jenkins, you're slower than my first moped [15:00:42] James_F: :) I can SWAT. [15:00:48] <_joe_> and that, I needed to push it uphill [15:00:50] * James_F grins. [15:00:58] Thank you thcipriani, that'd be great. [15:01:03] <_joe_> thcipriani: is there space for a small additional patch for swatting? [15:01:08] <_joe_> I forgot to add it :( [15:01:12] _joe_: sure, np. [15:01:26] <_joe_> I'll add it now [15:01:55] (03PS5) 10Thcipriani: Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [15:02:09] Here goes nothing. [15:02:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [15:02:14] (03CR) 10Gehel: "puppet compiler output: https://puppet-compiler.wmflabs.org/2409/" [puppet] - 10https://gerrit.wikimedia.org/r/282946 (https://phabricator.wikimedia.org/T132387) (owner: 10Gehel) [15:02:53] (03Merged) 10jenkins-bot: Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [15:04:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor Single Edit Tab on the English Wikipedia [[gerrit:274131]] (duration: 00m 55s) [15:04:51] ^ James_F check please [15:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:12] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2199048 (10csteipp) a:05csteipp>03None [15:05:18] * James_F waits patiently for the cache to rotate. [15:06:36] thcipriani: LGTM. [15:06:46] James_F: neat. Thanks! [15:06:57] (03PS2) 10Thcipriani: Make use of the local, not master, parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282894 (owner: 10Giuseppe Lavagetto) [15:07:07] thcipriani: Thank you! [15:07:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282894 (owner: 10Giuseppe Lavagetto) [15:07:38] (03Merged) 10jenkins-bot: Make use of the local, not master, parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282894 (owner: 10Giuseppe Lavagetto) [15:08:46] csteipp: ping me when you're awake? I'm seeing a bug in the totp keystone plugin but not sure how to fix it. [15:09:17] (03CR) 10Giuseppe Lavagetto: "Hi, this is scheduled for PuppetSWAT today, but there is:" [puppet] - 10https://gerrit.wikimedia.org/r/282186 (owner: 1020after4) [15:10:14] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Make use of the local, not master, parsoid cluster [[gerrit:282894]] (duration: 00m 28s) [15:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:21] ^ _joe_ check please [15:10:26] <_joe_> thcipriani: ok [15:11:09] <_joe_> thcipriani: seems ok according to the logs [15:11:27] _joe_: okie doke. Thanks! [15:11:35] <_joe_> thank you [15:12:17] (03CR) 10Gehel: [C: 031] elasticsearch: Pin elasticsearch package to specific version [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [15:16:26] (03CR) 10Ema: [C: 032 V: 032] Update misc VTC test cases (all HTTPS, no gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/282947 (owner: 10Ema) [15:20:19] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2199095 (10Varnent) 05Open>03Resolved Thank you @faidon for your quick work on this!!! I appreciate the notes about the ineffectiveness of this particular setup, an... [15:24:18] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199111 (10thcipriani) aqs deployment failures === It looks like the scap puppet provider is attempting to deploy `analytics/aqs/deploy` from `deployment-t... [15:25:10] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1402411 (10BBlack) I've done a bunch of cleanup on misc-web today, including: 1. removing the dead service entries (download, gerrit, rt) 2. inverting the existing TLS-red... [15:25:46] 06Operations, 10Beta-Cluster-Infrastructure, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2199121 (10mobrovac) [15:27:32] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2199127 (10BBlack) I should add: once we can kill the last entry in that last of HTTPS-exceptions, we can drop that whole block and simply set cache_misc's https_redirects... [15:29:01] (03CR) 10BBlack: [C: 031] remove wdqs1002 from varnish during reinstall / fix [puppet] - 10https://gerrit.wikimedia.org/r/282946 (https://phabricator.wikimedia.org/T132387) (owner: 10Gehel) [15:32:06] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [15:32:13] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2199146 (10BBlack) [15:32:16] 06Operations, 10Analytics, 10DNS, 10Traffic: Create data.wikimedia.org - https://phabricator.wikimedia.org/T132407#2199157 (10Nuria) >It sounds like this is more of a request for a server to host/run tools than a request for just a DNS A record. I'm curious why Wikimedia Labs is insufficient. Being a prod... [15:32:52] godog: they're identical to what's in git, except for whitespace etc. [15:32:58] i applied them locally because puppet was disabled [15:33:10] you can delete the files and let puppet recreate them [15:33:16] or git reset --hard [15:33:50] ori: ok thanks! [15:33:59] godog: thank you [15:34:00] 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2199162 (10BBlack) [15:36:48] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:46] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2199185 (10BBlack) [15:42:54] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2199202 (10BBlack) [15:43:24] (03PS2) 10Gehel: remove wdqs1002 from varnish during reinstall / fix [puppet] - 10https://gerrit.wikimedia.org/r/282946 (https://phabricator.wikimedia.org/T132387) [15:43:39] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2199215 (10BBlack) [15:43:50] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2199229 (10BBlack) [15:43:52] (03CR) 10Filippo Giunchedi: "LGTM overall, some comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [15:44:02] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2199245 (10BBlack) [15:44:15] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for transparency.wikimedia.org - https://phabricator.wikimedia.org/T132464#2199258 (10BBlack) [15:44:30] 06Operations, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2199271 (10BBlack) [15:45:27] (03CR) 10Gehel: [C: 032] remove wdqs1002 from varnish during reinstall / fix [puppet] - 10https://gerrit.wikimedia.org/r/282946 (https://phabricator.wikimedia.org/T132387) (owner: 10Gehel) [15:45:32] !log disabling wdqs1002 on the varnish cache::misc cluster via puppet [15:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:53] (03PS2) 10BBlack: Disable ulsfo T128424 [dns] - 10https://gerrit.wikimedia.org/r/282943 [15:47:23] (03CR) 10BBlack: [C: 032] Disable ulsfo T128424 [dns] - 10https://gerrit.wikimedia.org/r/282943 (owner: 10BBlack) [15:47:45] !log Draining traffic from ulsfo via GeoDNS updates for T128424 maintenance [15:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:54] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199292 (10Dzahn) @CCogdill_WMF @bblack great, we can merge it now then it looks [15:53:06] (03PS7) 10Dzahn: delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) [15:53:26] (03CR) 10Dzahn: [C: 032] delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [15:53:47] (03CR) 10BBlack: [C: 032] delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [15:54:31] <_joe_> coreyfloyd: around? I wanted to deploy the apache change whenever you're here [15:55:13] !log removed all email.donate. from DNS [15:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:42] <_joe_> !log disabling puppet on all mw hosts for apache change [15:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:51] (03PS4) 10Giuseppe Lavagetto: Send wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [15:57:35] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199348 (10Dzahn) @CCogdill_WMF We have merged the config change on the DNS servers. Changes should... [15:58:04] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07Blocked-on-Fundraising-Tech, 07HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2199353 (10BBlack) [15:58:08] 06Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2199352 (10BBlack) [15:58:12] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199351 (10BBlack) 05Open>03Resolved [15:58:18] 06Operations, 06Community-Advocacy, 10Traffic, 13Patch-For-Review: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#2199358 (10BBlack) [15:58:20] 06Operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#2199359 (10BBlack) [15:58:23] 06Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1374642 (10BBlack) 05Open>03Resolved a:03BBlack They're gone! [15:59:36] <_joe_> jouncebot: next? [16:00:04] godog moritzm coreyfloyd: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160412T1600). Please do the needful. [16:00:04] coreyfloyd twentyafterfour: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] (03PS2) 10BBlack: Normalize REST API Accept headers [puppet] - 10https://gerrit.wikimedia.org/r/281042 (https://phabricator.wikimedia.org/T128040) (owner: 10GWicke) [16:00:19] <_joe_> coreyfloyd, twentyafterfour ping? [16:00:42] <_joe_> Krenair: I see your patch has been scheduled by coreyfloyd, should I wait for him? [16:00:47] (03CR) 10BBlack: [C: 032 V: 032] Normalize REST API Accept headers [puppet] - 10https://gerrit.wikimedia.org/r/281042 (https://phabricator.wikimedia.org/T128040) (owner: 10GWicke) [16:01:02] _joe_, probably a good idea [16:01:29] <_joe_> Krenair: ack, if none of them shows up, I'll just abort the swat :) [16:01:36] <_joe_> It's on hold for now I'd say [16:03:41] * bd808 pokes coreyfloyd in other channels he is know to idle in [16:05:40] it's on his calendar and everything [16:06:10] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2199383 (10scfc) The general issue with exported resources is that they require only friendlies on the same `puppetmaster`. In Labs (almost) anyone can be `root... [16:06:12] (03CR) 10Andrew Bogott: Use half-baked ldap auth for librenms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [16:06:18] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 2 failures [16:06:48] _joe_: I can check twentyafterfour 's patch for him. [16:07:05] <_joe_> thcipriani: so I wanted it to have a marginally better commit message [16:07:09] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 2 failures [16:07:19] <_joe_> I understand that's needed to be able to deploy phabricator via scap3 [16:07:28] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 2 failures [16:07:28] <_joe_> but maybe stating it a bit better? [16:07:32] (03PS4) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [16:07:50] <_joe_> and while we're at it, respect our commit message format [16:07:53] <_joe_> :) [16:07:57] <_joe_> which would be [16:08:15] andrewbogott: :) fwiw -> https://gerrit.wikimedia.org/r/#/c/229299/ [16:08:19] <_joe_> scap: add configuration for phabricator [16:08:33] wfm :) [16:08:40] bd808 [16:08:42] thanks [16:08:51] _joe_: I’m here sorry distracted [16:08:55] <_joe_> coreyfloyd: oh here you are, saved by the bell :P [16:08:57] mutante: yeah, my patch has a reference to your old patch + an apology :) [16:09:10] _joe_: i need to stop writing emails [16:09:12] <_joe_> thcipriani: corey has precedence as his change can be potentially harmful [16:09:15] andrewbogott: did not see that yet, thanks for working on it !:) [16:09:23] <_joe_> coreyfloyd: I stopped 3 years ago [16:09:29] lol [16:09:32] <_joe_> thcipriani: can you amend the change? [16:09:38] _joe_: ack. Doing. [16:09:46] <_joe_> thanks <3 [16:09:50] (03PS4) 10Thcipriani: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (owner: 1020after4) [16:09:51] mutante: the docs are terrible and the code has no error handling, so I had to hack on the librenms source a bit to figure out what the heck they wanted. It does sort of work though. [16:09:52] (03PS1) 10BBlack: Bugfix for REST API Accept normalization: bad string [puppet] - 10https://gerrit.wikimedia.org/r/282957 [16:09:54] (03PS1) 10Yuvipanda: tools: Do not install fonts on precise instances [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) [16:10:04] <_joe_> bblack: I am merging patches for puppet-swat [16:10:08] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 2 failures [16:10:17] <_joe_> can you wait before merging? or is it like urgent? [16:10:24] (03PS5) 10Giuseppe Lavagetto: Send wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [16:10:34] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Send wikipedia.org/apple-app-site-association to the right file without an external redirect [puppet] - 10https://gerrit.wikimedia.org/r/281900 (https://phabricator.wikimedia.org/T131250) (owner: 10Alex Monk) [16:10:37] (03CR) 10BBlack: [C: 032 V: 032] Bugfix for REST API Accept normalization: bad string [puppet] - 10https://gerrit.wikimedia.org/r/282957 (owner: 10BBlack) [16:10:41] andrewbogott: *nod* all i head was that blog post that was linked there. very cool that you got it to work. there was apparently also a big difference between opendj and openldap now [16:10:44] <_joe_> beat you :P [16:10:45] had [16:10:49] sorry, not even reading here [16:10:55] <_joe_> heh [16:11:09] I think you merged mine and I merged yours, or something [16:11:24] <_joe_> did you puppet-merge already? [16:11:28] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [16:11:37] yes [16:11:46] <_joe_> ok so you merged mine [16:11:51] I don't know how it's possible that mine got merged unless you did though [16:11:57] <_joe_> coreyfloyd: I'm gonna apply the change to mw1017 [16:12:02] mine seems to be after yours, and I never got a message about mine [16:12:16] _joe_: ok - thanks [16:12:18] <_joe_> yours is not merged I think [16:12:25] no it definitely is [16:12:30] <_joe_> https://gerrit.wikimedia.org/r/#/c/282957/ [16:12:37] (03CR) 10Dzahn: [C: 031] Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [16:12:39] <_joe_> Can Merge "no" [16:12:44] ugh [16:12:53] (03PS2) 10BBlack: Bugfix for REST API Accept normalization: bad string [puppet] - 10https://gerrit.wikimedia.org/r/282957 [16:13:00] (03CR) 10BBlack: [V: 032] Bugfix for REST API Accept normalization: bad string [puppet] - 10https://gerrit.wikimedia.org/r/282957 (owner: 10BBlack) [16:13:31] <_joe_> coreyfloyd: so the expected outcome is that the file gets served directly and not via a series of redirects, right? [16:13:40] _joe_: yep [16:13:53] _joe_: that redirect is hated by Apple’s universal link tech [16:13:57] <_joe_> coreyfloyd: do you have the wikimedia-debug chrome extension? [16:14:01] (03Abandoned) 10Dzahn: librenms - enable LDAP auth (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [16:14:10] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 2 failures [16:14:10] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2199455 (10NahidSultan) another one: https://upload.wikimedia.org/wikipedia/commons/6/60/Sajid-kiptus.webm htt... [16:14:18] _joe_: no - let me find and install [16:14:53] <_joe_> coreyfloyd: that allows your requests from the browser to be served directly by one of our test hosts [16:15:19] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 2 failures [16:15:38] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 2 failures [16:15:50] (03CR) 10BBlack: [C: 031] Port varnishxcps to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/282887 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [16:15:55] _joe_: do you have a link for that extension… can’t seem to find it using search [16:16:07] <_joe_> X-Wikimedia-Debug [16:16:18] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 2 failures [16:16:36] <_joe_> coreyfloyd: anyways, it seems to work correctly https://dpaste.de/E9QF/raw [16:16:37] _joe_: also would help if i opened chrome instead of safari… [16:16:45] <_joe_> coreyfloyd: ahahah apple maniac :P [16:16:58] _joe_: ok - thanks - i will check - i also have a way to verify using the device [16:17:03] _joe_: you have no idea [16:17:21] <_joe_> coreyfloyd: it's deployed just to one server, not to all of them [16:17:26] <_joe_> but it works as expected [16:17:35] <_joe_> so I think I can just spread the change now [16:17:38] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:17:39] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:17:54] _joe_: ok cool - i’ll test as on device as soon as you do [16:18:04] <_joe_> !log reenabling puppet everywhere [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:11] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2199461 (10Milimetric) p:05Triage>03Normal [16:18:17] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 2 failures [16:18:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2199462 (10Milimetric) p:05Triage>03Normal [16:18:26] (03PS5) 10Thcipriani: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:19:21] <_joe_> coreyfloyd: it will take some time to spread the change, I'll ping you when it's done [16:19:34] _joe_: thanks [16:20:38] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 2 failures [16:20:58] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 2 failures [16:21:08] (03PS6) 10Giuseppe Lavagetto: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:21:09] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [16:21:14] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2199473 (10Krenair) AFAIK everyone with access to deployment-prep has root on deployment-puppetmaster, which is the puppetmaster for all instances in the project... [16:21:18] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 2 failures [16:21:19] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [16:21:38] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 2 failures [16:21:44] <_joe_> thcipriani: LGTM, but I'd like to wait for the apache change to spread [16:21:57] anomie: can you confirm that https://phabricator.wikimedia.org/T126262 is resolved? [16:22:20] _joe_: np [16:22:27] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 2 failures [16:22:28] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet has 2 failures [16:22:38] * andrewbogott looking at those puppet errors [16:22:58] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 2 failures [16:22:58] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:23:17] andrewbogott: Seems to be [16:23:18] stupid strontium fetch fails :P [16:23:29] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 2 failures [16:23:42] andrewbogott: Wait. Not on tin. [16:24:01] (03PS1) 10EBernhardson: cirrus: Force beta cluster to always use eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282961 (https://phabricator.wikimedia.org/T132408) [16:24:07] bblack: 'stupid strontium fetch fails' is in regards to those puppet errors? [16:24:20] Ah, sure enough, it's intermittent [16:24:25] well, in regards to the second round of them, when I thought salt was fixing them all [16:24:38] salting again! [16:24:58] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 2 failures [16:25:07] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:25:08] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:17] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 2 failures [16:25:19] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 2 failures [16:25:38] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [16:25:59] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:26:08] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:26:18] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:26:18] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:26:39] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:26:55] (03CR) 10Merlijn van Deen: [C: 04-1] "This brings current and new (built in the future) exec nodes in an inconsistent state -- I think we should either actively remove the pack" [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [16:26:57] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:00] (03CR) 10DCausse: [C: 031] cirrus: Force beta cluster to always use eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282961 (https://phabricator.wikimedia.org/T132408) (owner: 10EBernhardson) [16:27:08] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:11] (03PS1) 10Dzahn: rsync home directories for stat1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/282963 (https://phabricator.wikimedia.org/T76348) [16:27:19] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:22] (03CR) 10EBernhardson: [C: 032] cirrus: Force beta cluster to always use eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282961 (https://phabricator.wikimedia.org/T132408) (owner: 10EBernhardson) [16:27:33] (03PS2) 10Dzahn: rsync home directories for stat1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/282963 (https://phabricator.wikimedia.org/T76348) [16:27:37] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:07] (03Merged) 10jenkins-bot: cirrus: Force beta cluster to always use eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282961 (https://phabricator.wikimedia.org/T132408) (owner: 10EBernhardson) [16:28:08] mutante: we might also need to save more stuff from stat1001, milimetric is going to update the phab task [16:28:08] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 2 failures [16:28:08] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:09] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2199484 (10BBlack) [16:28:39] (03CR) 10Yuvipanda: "Right, so I'd like to not do the latter (maintain our own copy of fonts packages)." [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [16:28:42] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2199488 (10Milimetric) Ok, there are two places I know have important files on stat1001. 1. The files served at datasets.wikimedia.org/ These are stored in three places, one... [16:28:55] elukey: ok, good. we can add the additional pathes needed [16:29:09] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-labs.php: No-op sync of beta cluster config change for T132408 (duration: 00m 26s) [16:29:10] T132408: [Regression pre-wmf.21] Search for pages/templates in VE is not working in Beta cluster - https://phabricator.wikimedia.org/T132408 [16:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:24] 06Operations, 10Traffic, 07Easy, 07HTTPS: WMF-Last-Access cookies doesn't set Secure flag - https://phabricator.wikimedia.org/T105451#2199492 (10BBlack) [16:29:26] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2199493 (10BBlack) [16:29:27] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:29:51] (03PS3) 10Dzahn: rsync home directories for stat1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/282963 (https://phabricator.wikimedia.org/T76348) [16:30:21] (03PS4) 10Dzahn: rsync home directories for stat1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/282963 (https://phabricator.wikimedia.org/T76348) [16:30:38] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:30:48] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:18] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:18] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:38] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:47] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:53] (03CR) 10Dzahn: [C: 032] "includes ferm rule, we have enough space on fluorine, additional pathes will be added" [puppet] - 10https://gerrit.wikimedia.org/r/282963 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [16:31:55] anomie: so, we have 'deployment' hosts and 'maintenance' hosts and I'm not sure I understand the difference. I can just explicitly add that package to deployment hosts as well, but maybe deployment hosts should also be maintenance hosts? [16:32:36] andrewbogott: no, they are 2 separate roles. maintenance hosts run a bunch of cron jobs that run maintenance scripts [16:32:38] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:32:51] like all the things Mediawiki wants to run regularly in the background [16:32:58] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:32:58] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:33:00] andrewbogott: I have no idea. Now that we have mwrepl and I learned how to use the 'sql' script, I haven't run into the bug much lately. [16:33:20] (03PS1) 10Ema: Automatically restart varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) [16:33:55] deployment hosts are not (necessarily) related to that [16:33:59] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:34:08] (03PS1) 10Gehel: Increase max size for /var/lib/wdqs to take all remaining space [puppet] - 10https://gerrit.wikimedia.org/r/282966 (https://phabricator.wikimedia.org/T120714) [16:34:45] (03CR) 10RobH: [C: 031] Increase max size for /var/lib/wdqs to take all remaining space [puppet] - 10https://gerrit.wikimedia.org/r/282966 (https://phabricator.wikimedia.org/T120714) (owner: 10Gehel) [16:35:38] (03PS2) 10Gehel: Increase max size for /var/lib/wdqs to take all remaining space [puppet] - 10https://gerrit.wikimedia.org/r/282966 (https://phabricator.wikimedia.org/T120714) [16:36:20] (03CR) 10BBlack: "Maybe add RestartSec= at some value like "1" or even "5" as well? Otherwise if it's persistently crashing, systemd will spam restarts and " [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [16:37:28] <_joe_> thcipriani: since this doesn't seem to finish for now, I'll go on with your patch [16:37:54] _joe_: kk [16:38:09] (03PS7) 10Giuseppe Lavagetto: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:38:12] (03CR) 10Gehel: [C: 032] Increase max size for /var/lib/wdqs to take all remaining space [puppet] - 10https://gerrit.wikimedia.org/r/282966 (https://phabricator.wikimedia.org/T120714) (owner: 10Gehel) [16:38:16] (03PS1) 10Dzahn: don't use fluorine for rsyncd, conflicts [puppet] - 10https://gerrit.wikimedia.org/r/282967 (https://phabricator.wikimedia.org/T76348) [16:38:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:38:25] should just need a run on the two deployment machines, will create a new group + checkout the phab repo from the looks of it. [16:38:26] (03PS8) 10Giuseppe Lavagetto: scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:38:29] <_joe_> oh wtf guys [16:38:34] (03PS1) 10Andrew Bogott: Include php5-readline package on deployment servers. [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) [16:38:37] <_joe_> everyone merging during the puppetwat window [16:38:43] (03PS2) 10Dzahn: don't use fluorine for rsyncd, conflicts [puppet] - 10https://gerrit.wikimedia.org/r/282967 (https://phabricator.wikimedia.org/T76348) [16:38:51] (03CR) 10Giuseppe Lavagetto: [V: 032] scap: add configuration for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/282186 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:39:20] <_joe_> gehel: I'm merging your patch too [16:39:23] (03CR) 10Dzahn: [C: 032] "sorry, one more merge because i broke something. will wait then @joe" [puppet] - 10https://gerrit.wikimedia.org/r/282967 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [16:39:34] _joe_: thanks! [16:39:47] (03PS3) 10Dzahn: don't use fluorine for rsyncd, conflicts [puppet] - 10https://gerrit.wikimedia.org/r/282967 (https://phabricator.wikimedia.org/T76348) [16:40:05] (03CR) 10Dzahn: [V: 032] don't use fluorine for rsyncd, conflicts [puppet] - 10https://gerrit.wikimedia.org/r/282967 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [16:40:15] (03PS1) 10Giuseppe Lavagetto: Revert "scap: add configuration for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/282969 [16:40:34] (03PS2) 10Giuseppe Lavagetto: Revert "scap: add configuration for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/282969 [16:40:37] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [16:40:37] * mutante definitely waits now [16:40:39] <_joe_> seriously, stop it [16:40:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "scap: add configuration for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/282969 (owner: 10Giuseppe Lavagetto) [16:41:12] (03PS2) 10Andrew Bogott: Include php5-readline package on deployment servers. [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) [16:41:32] (03CR) 10Giuseppe Lavagetto: "this broke puppet on tin," [puppet] - 10https://gerrit.wikimedia.org/r/282969 (owner: 10Giuseppe Lavagetto) [16:41:48] _joe_: it is like trying to get to Rome's city centre in the morning [16:42:15] (03PS2) 10Ema: Automatically restart varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) [16:42:27] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2199597 (10He7d3r) [16:42:28] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:42:37] _joe_: ugh. sorry. [16:42:53] <_joe_> thcipriani: np, I need to take a better look at it [16:43:56] <_joe_> coreyfloyd: the change is applied and works correctly; but the redirect is currently cached into varnish [16:44:09] <_joe_> honestly, I'd prefer not to un-cache it unless it's urgent [16:44:22] (03CR) 10Merlijn van Deen: "> (maintain our own copy of fonts packages)." [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [16:44:39] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [16:44:43] uhm... [16:44:57] (03CR) 10Yuvipanda: "hmm, fair enough. I'll amend with a note." [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) (owner: 10Yuvipanda) [16:46:31] (03CR) 10BBlack: [C: 031] Automatically restart varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [16:46:34] <_joe_> twentyafterfour: you can re-schedule the change once amended for thursday [16:46:49] _joe_: I don't know how to amend it [16:46:59] it shouldn't break, worked on beta [16:47:02] * twentyafterfour is stumpped [16:47:44] <_joe_> twentyafterfour: beta doesn't use the admin module [16:47:46] Can someone nuke /tmp/make-wmf-branch/ on tin for me? [16:47:46] obviously I'm missing something [16:47:58] <_joe_> twentyafterfour: the puppet compiler would've told you [16:48:02] (03PS3) 10Ema: Automatically restart varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) [16:48:03] <_joe_> ostriches: I am already on tin [16:48:07] <_joe_> ostriches: the whole dir? [16:48:10] (03CR) 10Ema: [C: 032 V: 032] Automatically restart varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282965 (https://phabricator.wikimedia.org/T132430) (owner: 10Ema) [16:48:10] _joe_ I ran it through puppet compiler [16:48:29] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:48:37] (03CR) 10Anomie: Include php5-readline package on deployment servers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) (owner: 10Andrew Bogott) [16:48:40] <_joe_> ostriches: done [16:48:51] (03PS2) 10Ema: Port varnishxcps to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/282887 (https://phabricator.wikimedia.org/T131353) [16:48:52] _joe_: tyvm [16:49:01] (03CR) 10Ema: [C: 032 V: 032] Port varnishxcps to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/282887 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [16:49:16] I didn't do anything different with the admin module, afaik... just simply adding a group [16:49:30] * twentyafterfour shrugs [16:50:06] <_joe_> coreyfloyd: so, if you ask for e.g. https://www.wikipedia.org/apple-app-site-association?nocache [16:50:10] <_joe_> it works as expected [16:50:16] <_joe_> the simple base url doesn't [16:50:20] <_joe_> because it's cached [16:53:42] there's no Cache-Control on the old redirect [16:54:10] so it would've defaulted to 3 days [16:54:14] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2199628 (10Dzahn) >>! In T76348#2198981, @elukey wrote: > @Dzahn another solution (more a hack) is to open a port / disable ferm for a bit (/me hides from @MoritzMuehlenhoff) and... [16:54:36] (03PS3) 10Andrew Bogott: Include php5-readline package on deployment servers. [puppet] - 10https://gerrit.wikimedia.org/r/282968 (https://phabricator.wikimedia.org/T126262) [16:57:45] 06Operations, 10MediaWiki-General-or-Unknown, 07Graphite: mediawiki statsd traffic - https://phabricator.wikimedia.org/T132472#2199631 (10fgiunchedi) [16:57:56] <_joe_> bblack: ugh, so we might need to un-cache those [16:58:56] does it even matter? [16:59:11] I mean in terms of getting it done quickly [16:59:20] <_joe_> what do you mean? [16:59:37] I mean the state we were in before was fine for a long time. is it now critical this changes for all clients sooner than 3 days? [16:59:57] <_joe_> I don't know, let's see if coreyfloyd says something :) [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160412T1700). [17:00:15] ? [17:00:18] no parsoid deploy [17:00:28] nope [17:00:28] sorry, but I'm still fighting the meta-fight of "one-off cache bans should not be routine. please plan accordingly" [17:01:08] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2199650 (10Dzahn) >>! In T76348#2199488, @Milimetric wrote: > 1. The files served at datasets.wikimedia.org/ These are stored in three places, one is in /var/www and the other... [17:01:13] <_joe_> coreyfloyd: read the backlog [17:01:15] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199651 (10Ladsgroup) @thcipriani I was dealing with this issue while I was working on using scap I thought I solved it. That's what I've got so far: 70 me... [17:01:22] _joe_: reading… [17:01:33] <_joe_> the change is applied successfully, but it will take up to 3 days to get flushed form the varnish caches [17:01:46] _joe_: thats ok - it isn’t urgent [17:01:48] 07Puppet, 10Beta-Cluster-Infrastructure, 03Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199659 (10Ladsgroup) I'll be around in IRC if you want to discuss or ask question. [17:01:49] <_joe_> if it's not urgent/something completely broken, I'd wait [17:01:51] <_joe_> ok cool [17:01:54] <_joe_> then we're done [17:01:55] <_joe_> :) [17:01:59] <_joe_> thanks [17:02:00] _joe_: cool - thanks again [17:02:04] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 03Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199660 (10Ladsgroup) [17:02:16] !log reinstall of wdqs1002 [17:02:21] _joe_: I’ll let you know when I can test the changes on the device in a few days [17:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:38] <_joe_> gehel: what's up with wdqs1002 then? [17:02:53] <_joe_> still can't boot? [17:02:57] _joe_: unfortunately I can’t force Apple to pass a nocache parameter… [17:02:58] RECOVERY - Host wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [17:03:06] <_joe_> coreyfloyd: I know :/ [17:03:11] Stupid apple [17:03:17] we needed to rebuild the raid to include new disks anyway, so we just went for the full reinstall [17:06:43] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2199666 (10Gehel) Full reinstall is in progress as we needed to account for new disks anyway. [17:07:19] PROBLEM - dhclient process on wdqs1002 is CRITICAL: Connection refused by host [17:07:19] PROBLEM - RAID on wdqs1002 is CRITICAL: Connection refused by host [17:07:39] PROBLEM - Blazegraph Port on wdqs1002 is CRITICAL: Connection refused by host [17:07:40] PROBLEM - Updater process on wdqs1002 is CRITICAL: Connection refused by host [17:07:58] PROBLEM - configured eth on wdqs1002 is CRITICAL: Connection refused by host [17:07:59] PROBLEM - Blazegraph process on wdqs1002 is CRITICAL: Connection refused by host [17:08:00] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: Connection refused [17:08:07] gehel: you probably need to tell icinga to shut up :) [17:08:18] PROBLEM - WDQS HTTP Port on wdqs1002 is CRITICAL: Connection refused by host [17:08:18] PROBLEM - Disk space on wdqs1002 is CRITICAL: Connection refused by host [17:08:37] PROBLEM - Check size of conntrack table on wdqs1002 is CRITICAL: Connection refused by host [17:08:37] PROBLEM - puppet last run on wdqs1002 is CRITICAL: Connection refused by host [17:08:38] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: Connection refused [17:08:48] PROBLEM - salt-minion processes on wdqs1002 is CRITICAL: Connection refused by host [17:08:48] PROBLEM - DPKG on wdqs1002 is CRITICAL: Connection refused by host [17:09:45] damn, sorry for the noise... [17:11:00] gehel: done, like this, fwiw [17:11:22] @neon:/usr/local/bin# icinga-downtime -h wdqs1002 -d 2 -r reinstall [17:11:54] mutante: Oh, that's nice... I was still clicking in the web interface... [17:12:47] gehel: sometimes this, sometimes that seems quicker :) [17:13:10] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2199673 (10matmarex) @bblack pointed out that if the upload.wikimedia.org URL is still accessible after deletin... [17:13:53] mutante: looking into icinga via CLI was actually in my todo list... [17:14:19] eh, -d 7200 , it's seconds :) [17:14:50] the script writes into the commandfile="/var/lib/nagios/rw/nagios.cmd" [17:15:19] _joe_, doesn't varnish cache up to 30 days? [17:15:26] which is a fifo [17:15:49] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#2199688 (10Cmjohnson) 05Open>03Resolved removed from rack added to decommissioned list resolving this task [17:16:41] mutante: monty just thanked wikimedia for starting to use mariadb, that was the start of its popularity [17:18:27] jzerebecki: cool! afair binasher actually started it [17:20:11] jzerebecki: Neat. [17:21:59] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 03Scap3: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2199720 (10Krenair) [17:22:06] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2199721 (10ema) [17:22:48] 06Operations, 10Analytics, 10Traffic, 07Varnish: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2199749 (10ema) p:05Triage>03Normal [17:23:44] <_joe_> Krenair: 3 days is what should happen in this case, according to bblack [17:25:37] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests: Decommission rubidium - https://phabricator.wikimedia.org/T118213#2199751 (10Cmjohnson) 05Open>03Resolved removed from rack, racktables updated. [17:26:15] 06Operations, 13Patch-For-Review: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2199758 (10Cmjohnson) 05Open>03Resolved removed from rack, removed from racktables. [17:27:26] _joe_, interesting, ok [17:28:00] (03PS1) 10Dzahn: stat1001: setup to rsync /srv/ and /var/www [puppet] - 10https://gerrit.wikimedia.org/r/282974 (https://phabricator.wikimedia.org/T76348) [17:29:20] (03CR) 10jenkins-bot: [V: 04-1] stat1001: setup to rsync /srv/ and /var/www [puppet] - 10https://gerrit.wikimedia.org/r/282974 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [17:32:37] (03PS2) 10Dzahn: stat1001: setup to rsync /srv/ and /var/www [puppet] - 10https://gerrit.wikimedia.org/r/282974 (https://phabricator.wikimedia.org/T76348) [17:37:00] 06Operations, 10ops-eqiad, 06DC-Ops: connect an external harddisk with >2TB space to stat1001 - https://phabricator.wikimedia.org/T132476#2199771 (10Dzahn) [17:37:51] (03PS3) 10Dzahn: stat1001: setup to rsync /srv/ and /var/www [puppet] - 10https://gerrit.wikimedia.org/r/282974 (https://phabricator.wikimedia.org/T76348) [17:39:24] (03CR) 10Dzahn: [C: 032] stat1001: setup to rsync /srv/ and /var/www [puppet] - 10https://gerrit.wikimedia.org/r/282974 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [17:39:56] (03PS1) 10Ema: varnishmedia: ignore transaction without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) [17:42:00] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2199791 (10ori) Coloring in some additional details. I noticed a regression in first paint time on desktop over the past three months and found a correlated slump in the percent of cli... [17:42:04] bblack, ema ^ [17:43:52] (03PS2) 10Dzahn: Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 (owner: 10Muehlenhoff) [17:44:41] (03CR) 10Dzahn: [C: 031] Don't use package-> latest for apt-transport-https [puppet] - 10https://gerrit.wikimedia.org/r/282941 (owner: 10Muehlenhoff) [17:46:06] (03CR) 10Dzahn: "@Paladox i don't know phabricator config details, 20after4 is the authority on this" [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [17:46:29] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:37] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:46:37] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:38] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:46] (03CR) 10Dzahn: "i know nothing about Differential" [puppet] - 10https://gerrit.wikimedia.org/r/281069 (https://phabricator.wikimedia.org/T131623) (owner: 10Paladox) [17:46:47] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:49] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:46:58] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:58] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:07] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:08] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:17] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:17] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:17] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:28] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:37] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:48] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:48] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:48] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:47:59] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:48:08] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:48:08] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:07] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [17:57:08] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [17:57:08] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [17:57:18] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [17:57:27] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [17:57:28] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [17:57:35] !log ssh alsafi fixes ganeti VM timeouts once again [17:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:48] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:57:48] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [17:57:49] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [17:57:58] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [17:57:58] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [17:58:10] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [17:58:17] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [17:58:17] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [17:58:18] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:58:19] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [17:58:28] RECOVERY - DPKG on alsafi is OK: All packages OK [17:58:28] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [17:58:28] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [17:58:32] 06Operations, 10ops-eqiad, 10hardware-requests: connect an external harddisk with >2TB space to stat1001 - https://phabricator.wikimedia.org/T132476#2199864 (10Dzahn) [17:58:38] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [17:58:47] RECOVERY - Disk space on alsafi is OK: DISK OK [17:58:48] (03PS1) 10Muehlenhoff: Enable base::firewall for all rdb* servers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/282979 [17:58:50] (03PS1) 10Muehlenhoff: Enable base::firewall for rdb* servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/282980 [17:59:46] 06Operations, 06Release-Engineering-Team, 10Traffic, 05Gitblit-Deprecate, 07HTTPS: HTTPS redirects for git.wikimedia.org - https://phabricator.wikimedia.org/T132460#2199866 (10Dzahn) [18:00:19] 06Operations, 10Parsoid, 10Traffic, 07HTTPS: HTTPS redirects for parsoid-tests.wikimedia.org - https://phabricator.wikimedia.org/T132462#2199867 (10Dzahn) [18:00:56] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2199873 (10Dzahn) [18:00:59] (03PS1) 10Chad: Group0 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282982 [18:01:53] 06Operations, 10Analytics-Cluster, 10Traffic, 07HTTPS: HTTPS redirects for stats.wikimedia.org - https://phabricator.wikimedia.org/T132465#2199877 (10Dzahn) [18:01:56] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2199879 (10ema) p:05Triage>03Normal [18:02:15] 06Operations: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325#2199880 (10ema) p:05Triage>03Normal [18:02:38] 06Operations: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2199881 (10ema) p:05Triage>03High [18:02:56] 06Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2199895 (10ema) p:05Triage>03Normal [18:03:02] 06Operations, 10Pybal, 10Traffic, 07HTTPS: HTTPS redirects for config-master.wikimedia.org - https://phabricator.wikimedia.org/T132459#2199185 (10Dzahn) [18:03:09] !log demon@tin Started scap: testwiki to wmf.21 and rebuild l10n [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:00] 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2199897 (10ema) p:05Triage>03Normal [18:04:11] (03PS1) 10Chad: Remove expired old staticy things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282983 [18:04:12] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2199898 (10ema) p:05Triage>03Normal [18:07:53] 06Operations, 10Ops-Access-Requests: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2199903 (10Dzahn) For completeness, this has been raised in the ops meeting yesterday as " gilles root on swift machines (perf-roots?)" and the outcome was: Meeting Outcome: if its just f... [18:09:16] (03PS2) 10Ema: varnishmedia: ignore transactions without status code [puppet] - 10https://gerrit.wikimedia.org/r/282976 (https://phabricator.wikimedia.org/T132430) [18:09:45] 06Operations, 10Ops-Access-Requests: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2199908 (10Dzahn) p:05Triage>03High [18:11:20] Could someone nuke /srv/mediawiki-staging/php-1.27.0-wmf.{12,13}/ from tin? Permissions on cache is preventing me. [18:14:33] (03CR) 10Dzahn: [C: 031] "looks like it makes sense, yea" [puppet] - 10https://gerrit.wikimedia.org/r/282478 (owner: 10Paladox) [18:14:48] mutante: ^^ [18:14:55] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 2 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199926 (10DStrine) [18:15:06] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2135390 (10DStrine) [18:18:27] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199931 (10CCogdill_WMF) p:05High>03Unbreak! Hi @Dzahn, I think a record was deleted that... [18:19:20] ostriches: done [18:19:31] ty [18:19:37] !log tin: rm php-1.27.0-wmf.12/13 from mw-staging [18:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:17] 06Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2199938 (10Dzahn) [18:20:20] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07Blocked-on-Fundraising-Tech, 07HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2199939 (10Dzahn) [18:20:24] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199936 (10Dzahn) 05Resolved>03Open yes, checking [18:22:32] !log demon@tin scap aborted: testwiki to wmf.21 and rebuild l10n (duration: 19m 23s) [18:22:36] !log demon@tin Started scap: testwiki to wmf.21 and rebuild l10n [18:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:13] (03PS1) 10Dzahn: re-add silverpop domainkey entry for fundraising [dns] - 10https://gerrit.wikimedia.org/r/282984 (https://phabricator.wikimedia.org/T130414) [18:24:02] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2199946 (10Cmjohnson) [18:24:04] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2199944 (10Cmjohnson) 05Open>03Resolved added to spares list...remains in rack [18:24:12] (03PS2) 10Dzahn: re-add silverpop domainkey entry for fundraising [dns] - 10https://gerrit.wikimedia.org/r/282984 (https://phabricator.wikimedia.org/T130414) [18:25:02] 06Operations, 10ops-eqiad, 06DC-Ops: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#2199950 (10Cmjohnson) 05Open>03Resolved well out of warranty. removed from rack and added to the decom list. [18:25:15] (03CR) 10Dzahn: [C: 032] "adding back asap, as requested by Ccogdil" [dns] - 10https://gerrit.wikimedia.org/r/282984 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [18:25:41] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10Cmjohnson) [18:25:43] 06Operations, 10ops-eqiad, 06DC-Ops: decom caesium - https://phabricator.wikimedia.org/T125165#2199953 (10Cmjohnson) 05Open>03Resolved added to spares page and remains in rack for potential use...will need more approval to decom for good. [18:26:35] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199957 (10Dzahn) >>! In T130414#2199931, @CCogdill_WMF wrote: > Hi @Dzahn, I think a record w... [18:26:58] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199962 (10CCogdill_WMF) Whew, thanks so much for the quick fix! [18:27:53] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#2199965 (10Cmjohnson) 05Open>03Resolved calcium has been removed from rack and updated to decom list in racktables. [18:29:23] 06Operations, 06Labs, 10procurement: Get an updated quote from HP for virt nodes - https://phabricator.wikimedia.org/T132363#2199968 (10Andrew) [18:29:46] 06Operations, 10Traffic, 10Wikimedia-Fundraising, 07Blocked-on-Fundraising-Tech, 07HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2199972 (10Dzahn) [18:29:49] 06Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2199971 (10Dzahn) [18:29:53] 06Operations, 10Fundraising-Backlog, 10Traffic, 10Wikimedia-Fundraising, and 3 others: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2199970 (10Dzahn) 05Open>03Resolved [18:30:12] 06Operations, 06Labs, 10hardware-requests, 10procurement: Get an updated quote from HP for virt nodes - https://phabricator.wikimedia.org/T132363#2195713 (10Andrew) [18:31:25] 06Operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#2199982 (10Cmjohnson) [18:31:27] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests: Decommission neptunium - https://phabricator.wikimedia.org/T122101#2199980 (10Cmjohnson) 05Open>03Resolved removed from rack and updated racktables [18:31:39] (03CR) 10Dzahn: "should this be amended or replaced by a new change?" [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [18:33:27] 06Operations, 10ops-eqiad, 10hardware-requests: reclaim to spares: cp1056, cp1057, cp1069, cp1070 - https://phabricator.wikimedia.org/T130884#2199990 (10Cmjohnson) [18:33:34] 06Operations, 06Labs, 10hardware-requests: Get an updated quote from HP for virt nodes - https://phabricator.wikimedia.org/T132363#2199991 (10Krenair) [18:33:36] 06Operations, 10ops-eqiad, 10hardware-requests: reclaim to spares: cp1056, cp1057, cp1069, cp1070 - https://phabricator.wikimedia.org/T130884#2149499 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson completed [18:33:52] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2199998 (10BBlack) Making the above analysis harder for anyone else looking: that's not a percentage of client connections using SPDY, it's a percentage of client **requests** using SP... [18:34:09] (03CR) 10Dzahn: "adding Rush" [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [18:34:43] (03PS2) 10Mschon: added SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) [18:37:46] !log demon@tin Finished scap: testwiki to wmf.21 and rebuild l10n (duration: 15m 09s) [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:52] (03CR) 10Dzahn: "any updates here? i keep forgetting what we said back then" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [18:42:19] csteipp: do you think we can call teh "data access guideliness" done? [18:42:21] *the [18:44:20] nuria_: I think you've given plenty of input. Legal is still working on making it official. [18:45:11] csteipp: maybe we can move it out of your user page so it can be found on office wiki? [18:45:45] 06Operations, 10ops-eqiad: ms-be1001.eqiad.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T132142#2200075 (10Cmjohnson) a:03fgiunchedi @fgiunchedi I replaced the disk and added the disk back but it's out of order. The server needs a reboot when you get a chance. Did this to add back 4... [18:48:05] 06Operations, 06Labs, 10hardware-requests: Get an updated quote from HP for virt nodes - https://phabricator.wikimedia.org/T132363#2200095 (10Andrew) 05Open>03declined Declined in favor of proper procurement ticket T132485 [18:49:32] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 13Patch-For-Review: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#1859769 (10Gehel) Reinstall in progress to account for new disks in wdqs1002. The wdqs1001 server will still need to be reins... [18:51:35] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2200112 (10Jdlrobson) Thanks to @tgr I've confirmed that the issue is that ParserOutput::getTOCEnabled is false for these pages:... [18:56:27] (03PS2) 10Dzahn: customize check_disk monitor options for datasets [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [18:58:10] (03PS3) 10Dzahn: customize check_disk monitor options for datasets [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [18:58:14] (03PS1) 10Ladsgroup: scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) [18:59:51] (03CR) 10Eevans: Simplification of Cassandra Logstash filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [19:00:06] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160412T1900). Please do the needful. [19:00:14] (03CR) 10Chad: [C: 032] Group0 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282982 (owner: 10Chad) [19:00:32] (03PS4) 10Dzahn: customize check_disk monitor options for datasets [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [19:00:40] (03Merged) 10jenkins-bot: Group0 to wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282982 (owner: 10Chad) [19:01:07] (03CR) 10Dzahn: [C: 032] "@ArielGlenn amended and being bold. this one-liner should be all that is needed here meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [19:01:17] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.21 [19:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:58] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:02:03] (03PS2) 10Ladsgroup: scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) [19:03:57] (03PS1) 10EBernhardson: Update prune.rb to latest 1.4.5 tag [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/282993 [19:04:33] (03PS2) 10EBernhardson: Update prune.rb to 1.4.5 tag [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/282993 [19:04:46] (03CR) 10Thcipriani: [C: 031] scap: A basic workaround for the git clone issue [puppet] - 10https://gerrit.wikimedia.org/r/282992 (https://phabricator.wikimedia.org/T132267) (owner: 10Ladsgroup) [19:04:48] (03PS3) 10EBernhardson: Update prune.rb to 1.4.5 tag [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/282993 [19:06:26] (03CR) 10EBernhardson: "not sure if this is right ... i notice that prod is installing logstash 1.5.3 and mwv is installing 1.4.5" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/282993 (owner: 10EBernhardson) [19:10:27] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2200262 (10Jdlrobson) This is on the SWAt calendar for Wednesday 9:00PST https://wikitech.wikimedia.org/wiki/Deployments#Wednesd... [19:18:39] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:20:49] (03PS1) 10Dzahn: introduce kraz.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/282997 (https://phabricator.wikimedia.org/T123729) [19:22:10] (03PS2) 10Dzahn: introduce kraz.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/282997 (https://phabricator.wikimedia.org/T123729) [19:22:25] (03CR) 10Dzahn: [C: 032] "Beta Corvi is the second brightest star in the southern constellation of Corvus. It has the traditional name Kraz. The origin and meaning " [dns] - 10https://gerrit.wikimedia.org/r/282997 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [19:27:33] !log create ganeti VM kraz.codfw.wmnet [19:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:46] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/filerepo/file/LocalFile.php: I6457cb91: Don't report image cache hits / misses (duration: 00m 39s) [19:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:57] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:32:18] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/filerepo/file/LocalFile.php: I6457cb91: Don't report image cache hits / misses (duration: 00m 31s) [19:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:47] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/db/loadbalancer/LBFactory.php: I7bc3b3aa: Revert "Measure commitMasterChanges() run time" (duration: 00m 27s) [19:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:14] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/db/loadbalancer/LBFactory.php: I7bc3b3aa: Revert "Measure commitMasterChanges() run time" (duration: 00m 27s) [19:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:13] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2200594 (10JanZerebecki) Does `info proc mappings` help at least identifying the coarse location? [20:02:26] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2200668 (10ArielGlenn) Not really: Start Addr End Addr Size Offset objfile 0x400000 0x282c000 0x242c000... [20:05:38] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 1 failures [20:12:23] (03CR) 10JanZerebecki: [C: 031] added SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [20:16:35] (03CR) 10Bmansurov: "Thanks, Max. That's what we had initially (https://gerrit.wikimedia.org/r/#/c/278408/), but I wasn't sure about how to test varnish itself" [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [20:17:20] 07Puppet, 10Beta-Cluster-Infrastructure, 06Revision-Scoring-As-A-Service, 13Patch-For-Review, 03Scap3: Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2200743 (10Ladsgroup) I cherry-picked this patch into the beta puppetmaster. So the issu... [20:19:33] (03CR) 10Brion VIBBER: [C: 031] "Looks good (note the support in core renders all WebP source images to PNG thumbnails), but have not tested. Might be wise to double-check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) (owner: 10Matanya) [20:19:50] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2200752 (10JanZerebecki) Perhaps installing the debug symbols will just work (package hhvm-dbg) to resolve the location. [20:25:39] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [20:26:59] (03PS4) 10Eevans: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 [20:27:58] (03CR) 10Eevans: [C: 031] write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [20:31:58] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:38:40] !log mwscript deleteEqualMessages.php --wiki glwikisource [20:38:45] !log mwscript deleteEqualMessages.php --wiki glwikibooks [20:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:46] !log mwscript deleteEqualMessages.php --wiki glwiki [20:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:55] !log mwscript deleteEqualMessages.php --wiki glwikiquote [20:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:58] (03Abandoned) 10EBernhardson: Update prune.rb to 1.4.5 tag [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/282993 (owner: 10EBernhardson) [20:53:06] (03CR) 10Dzahn: "could the ferm::rule go into a role class as well instead of site.pp?" [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:53:48] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:33] (03CR) 10Muehlenhoff: [C: 04-1] Gerrit: replicate git repositories to new home (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:58:18] (03PS1) 10Dzahn: site/install_server: add kraz.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/283064 (https://phabricator.wikimedia.org/T123729) [20:58:37] RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 844 bytes in 0.002 second response time [20:58:45] (03PS2) 10Dzahn: site/install_server: add kraz.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/283064 (https://phabricator.wikimedia.org/T123729) [20:59:08] RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 844 bytes in 0.006 second response time [20:59:27] RECOVERY - DPKG on wdqs1002 is OK: All packages OK [20:59:28] RECOVERY - salt-minion processes on wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:59:54] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2200984 (10Pchelolo) [20:59:57] RECOVERY - dhclient process on wdqs1002 is OK: PROCS OK: 0 processes with command name dhclient [20:59:58] RECOVERY - RAID on wdqs1002 is OK: OK: optimal, 1 logical, 2 physical [21:00:37] RECOVERY - configured eth on wdqs1002 is OK: OK - interfaces up [21:00:58] RECOVERY - Disk space on wdqs1002 is OK: DISK OK [21:01:08] RECOVERY - Check size of conntrack table on wdqs1002 is OK: OK: nf_conntrack is 0 % full [21:02:53] (03CR) 10Dzahn: [C: 032] site/install_server: add kraz.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/283064 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [21:04:08] RECOVERY - Blazegraph Port on wdqs1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [21:04:20] (03CR) 10Chad: Gerrit: replicate git repositories to new home (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [21:04:27] RECOVERY - Blazegraph process on wdqs1002 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [21:04:47] RECOVERY - WDQS HTTP Port on wdqs1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [21:04:58] RECOVERY - puppet last run on wdqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:05:10] (03CR) 10Krinkle: "We might be able to figure out group0/1/2 based on dbname (or we could emit it client-side), which would save a bit of the fragmentation a" [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) (owner: 10Ori.livneh) [21:05:48] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2200993 (10ArielGlenn) dumps.wm.o already redirects and so does downloads.wm.o. What's left? [21:08:26] (03CR) 10ArielGlenn: "amended when get back to salt-related issues next." [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [21:08:42] (03CR) 10ArielGlenn: [C: 04-1] "woops, removed -1 by mistake" [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [21:09:28] (03CR) 10ArielGlenn: "nope, haven't been able to revisit it, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [21:10:30] (03PS4) 10Aaron Schulz: [WIP] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [21:15:55] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2201055 (10BBlack) I'm confused - I think the last message above indicates we *do* have phab sending emails to mailing lists, which means we should use `?all`, b... [21:17:09] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2201057 (10greg) Right. Based on (at least?) those 3 accounts -> mailing lists, I guess we should use `?all`. [21:18:13] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201063 (10BBlack) This isn't for dumps or downloads, this is for `http://datasets.wikimedia.org` [21:31:03] (03PS3) 10Chad: Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 [21:36:24] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201127 (10ArielGlenn) How does that not redirect already? I thought/hoped that the 'default' parameter in the nginx conf on datasets would take care of t... [21:37:05] (03PS1) 10Dzahn: datasets: rewrite http to https [puppet] - 10https://gerrit.wikimedia.org/r/283084 (https://phabricator.wikimedia.org/T132463) [21:38:34] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201136 (10Dzahn) >>! In T132463#2201127, @ArielGlenn wrote: > How does that not redirect already? I thought/hoped that the 'default... [21:39:46] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201140 (10Dzahn) Somewhat counterintuitively the config for this is not in module dataset but instead in modules/statistics/files/... [21:41:14] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2201143 (10BBlack) I hate it, but I don't see any better way to solve the problem at present. To reword to be sure I have the... [21:43:43] (03PS1) 10Dzahn: stats/datasets: remove Apache virtual host stat1001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283086 [21:45:50] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201182 (10ArielGlenn) That just can't be true. And yet it is. 2d1fa5938f22c94324435c75acc4f496c1cacaa3 @Ottomata set it up. Some... [21:45:58] mutante: we don't have to redirect it in apache [21:46:11] for all of these tickets, we can delete one line in cache_misc's VCL and the redirect will happen in varnish [21:46:15] just need approval that it's ok [21:46:58] although I guess sometimes for efficiency's sake, the app needs updating too so that its internal links are https, to avoid having to constantly redirect in varnish all the time [21:47:42] 06Operations, 10Datasets-General-or-Unknown, 10Traffic, 07HTTPS, 13Patch-For-Review: HTTPS redirects for datasets.wikimedia.org - https://phabricator.wikimedia.org/T132463#2201190 (10Dzahn) Hah, and that says "# NOTE: This class has nothing to do with the # datasets site hosted at 'datasets.wikimedia.org... [21:49:50] btw that was my last comment on tickets for tonight [21:49:51] bblack: ooh, ok, so we have done it in Apache for lot of misc services in the past. [21:49:54] it's 1 am and I'm done [21:50:06] ok [21:50:35] mutante: basically for the ones that are left, if we're ok to redirect, we can delete their line in templates/varnish/misc-frontend.inc.vcl.erb until there's no lines left there [21:51:09] they may need/want other fixups at the service level too because they're emitting their own http:// absolute links or something, but that's just an optimization at that point; varnish won't let insecure traffic through to them. [21:51:48] bblack: *nod* , ok cool! [21:54:12] (also, note that all remaining cache_misc services are already set to HSTS-capture users who visit them over https on their own. If you go via HTTPS once with an HSTS-supporting browser, you get an HSTS header and you're stuck there. It can confuse testing!) [21:55:08] oh, interesting, didn't know there is such a capture thing [21:56:01] 06Operations, 10Traffic, 07Graphite, 07HTTPS: HTTPS redirects for graphite.wikimedia.org - https://phabricator.wikimedia.org/T132461#2201268 (10BBlack) [21:56:30] yeah all we're missing is the 301 redirect [21:57:38] alright, guess i can abandon that Apache change [21:58:12] (03PS1) 10BBlack: use https://parsoid-tests in testreduce T132462 [puppet] - 10https://gerrit.wikimedia.org/r/283088 [22:00:13] (03PS6) 10Smalyshev: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) [22:00:14] basically all of these tickets are attempts to elicit screams of "wait don't do that, it will break my something-or-other" [22:00:27] but if I don't see them this week, probably next step is go ahead and redirect them and then figure out who screams [22:01:08] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: puppet fail [22:01:10] i figured that, yea. it sounds like a good plan [22:02:51] (03PS1) 10BBlack: Use https://config-master.wm.o for rolematcher T132459 [puppet] - 10https://gerrit.wikimedia.org/r/283089 [22:04:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [22:04:52] (03CR) 10Krinkle: Apply rate limit to edits for normal users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [22:06:06] (03Abandoned) 10Dzahn: datasets: rewrite http to https [puppet] - 10https://gerrit.wikimedia.org/r/283084 (https://phabricator.wikimedia.org/T132463) (owner: 10Dzahn) [22:06:37] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail [22:16:43] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2201380 (10GWicke) @BBlack: Yes, that captures the idea very well. I'm also not too fond of the VCL part of this, but couldn'... [22:17:01] !log mathoid deploying ca7680521 [22:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:21:23] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10BBlack) [22:21:57] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201406 (10BBlack) [22:21:59] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2201405 (10BBlack) [22:23:04] 06Operations, 10Traffic, 07HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#2201410 (10BBlack) [22:23:06] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#1146258 (10BBlack) [22:27:12] (03Abandoned) 10Dzahn: remove db200[1-7] from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/278338 (https://phabricator.wikimedia.org/T125827) (owner: 10Dzahn) [22:27:23] ori: Hm.. is there an easy way to pull regexes in mwgrep? [22:27:53] I see regexp used elsewhere in the script, but I think term is limited to Java String#contains() per the mwgrep.groovy file [22:28:10] We now have insource:/pattern/ support in CirrusSearch on-wiki [22:28:19] so the mwgrep one is a bit behind in terms of functionality [22:29:07] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:31:25] wikipedia seems to be unreachable from my ISP [22:31:33] Vito, can you provide a traceroute? [22:31:38] yep Krenair [22:32:03] I'm also testing from a server on the same ISP [22:32:31] ofc I cannot log to phab [22:32:54] Vito, no worries, you can dump the traceroute straight into PM with me if needed [22:33:01] I can post it on phab [22:33:07] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:21] indeed [22:33:26] 06Operations, 10ops-eqiad, 10DBA: db1070 and db1071 overheating problems - https://phabricator.wikimedia.org/T132515#2201468 (10Volans) [22:33:45] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2201470 (10Volans) [22:34:04] here's the problem [22:34:38] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:34:44] mw1163? nobody? [22:34:59] looks at SAL [22:36:15] !log powercycled mw1163 [22:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:51] even ping fails [22:38:17] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [22:44:59] Vito: so a couple of us are looking at the traceroute. can you provide the earlier part of the full traceroute and include your ip info so we can attempt to reverse back? [22:45:46] We don't have any planned maintainance that would cause this (that I can see) [22:46:50] what traceroute? [22:47:03] https://phabricator.wikimedia.org/P2888 krenair made for him [22:47:07] but it only has part of it [22:47:10] see other channel [22:47:34] well, other channel just has basically that info. vito states he cannot reach our servers via his ISP. [22:47:46] and the paste has the second part of the traceroute [22:48:02] So I've requested that the full traceroute and source IP be provided. [22:48:09] sorry robh, was on another info [22:48:38] I removed the 172.16.0.0/12 stuffs [22:49:07] which is my ISP's tunnel towards their internal exchange point in Milan [22:49:35] and I disabled ping on both connection I've been using for testing .__. [22:49:49] afaics there's general problems with sprintlink, but that's just my approximate guess so far [22:49:57] you may have to contact your ISP [22:50:28] matt_flaschen: RoanKattouw: Can you roll out https://gerrit.wikimedia.org/r/#/c/272929/ today? [22:50:32] (and update if needed) [22:50:42] it seems to be a problem with sprint bblack, definitely [22:50:48] or better [22:51:03] I tried Amsterdam sprint's looking glass [22:51:16] and it seems to be able to reach my ISP [22:51:41] Krinkle: Sure [22:52:40] is gerrint in Ashburn? [22:52:44] *gerrit [22:52:44] sprint provides these interactive maps [22:52:55] https://www.sprint.net/performance [22:53:17] but i dont see an obvious thing with to/from Milan [22:53:34] yes Vito [22:53:36] Krinkle, yes, thanks for the reminder. [22:53:41] and I can load it Krenair [22:53:45] Vito: yes [22:53:59] can you ping text-lb.eqiad.wikimedia.org ? [22:54:47] mutante: 53ms form amsterdam to milan is awful [22:55:01] yep Krenair [22:55:42] Vito: their SLA says they are committing to 45ms backbone delay [22:55:47] so yea [22:56:06] * Vito z0mgs [22:56:18] it seems their Milan node has some problem though [22:56:37] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:56:57] uhm not really, just crappy [22:57:36] (03PS2) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [22:58:41] (03PS5) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 [22:59:09] ebernhardson: are you up for adding basic regexp support to mwgrep, a la insource:/pattern/ ? Krinkle was just asking for it. It would take me a while to figure out how to do it, I think. [22:59:44] ori: i could look into it i suppose, not really sure what has to be done either :) [23:00:00] where does the groovy file live? [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160412T2300). [23:00:25] ah, elastic itself [23:00:31] There's nothing in today's SWAT, but per Krinkle's request I'll deploy https://gerrit.wikimedia.org/r/#/c/272929/ [23:01:17] RoanKattouw, there is no. [23:01:20] That. :) [23:01:26] s/no/now. [23:01:27] ori: yea it's part of puppet and get's deployed to the servers [23:01:56] matt_flaschen: Oh and I see you just updated it too? [23:02:05] matt_flaschen: Actually, would you mind deploying it yourself? [23:02:11] RoanKattouw, not at all. [23:02:16] Thanks [23:02:19] RoanKattouw, yeah. I updated it for real a little while ago, but forgot to follow up. Then I just did a trivial rebase just now. [23:02:27] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 95 % full [23:02:29] Then I don't have to push off Joe, he asked to talk to me just a minute ago [23:03:43] RoanKattouw: Might have an urgent SWAT item, FWIW. [23:03:53] James_F: #REDIRECT matt_flaschen [23:04:04] matt_flaschen: https://gerrit.wikimedia.org/r/283100 ? [23:04:10] James_F, I'm doing SWAT. I.E. I'm doing my own config patch which in theory should have no effect other than slight perf boost. [23:04:18] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 34 % full [23:04:40] James_F, of course. [23:04:57] matt_flaschen: It's not merged in master yet. [23:05:27] James_F, I know. I'm "testing" it now. I.E. grepping for additional texfield. [23:05:34] James_F, do you want me to merge, or let someone else review? [23:05:50] matt_flaschen: legoktm just merged; am cherry-picking. [23:05:58] (03PS1) 10Dzahn: rsync bast1001 home dirs to osmium, temp for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/283102 (https://phabricator.wikimedia.org/T123721) [23:06:10] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/283103/ [23:06:21] (03PS2) 10Yuvipanda: tools: Don't track mediawiki's font list for precise [puppet] - 10https://gerrit.wikimedia.org/r/282958 (https://phabricator.wikimedia.org/T132282) [23:07:39] Krinkle: what exactly are you trying to do with regex in mwgrep? [23:07:58] James_F, I do wonder why we're using input type=search, but styling it as text. Why not just use input type=text? Probably there is a good reason, though. [23:08:07] ebernhardson: To find more precise patterns. E.g. to find '\bchangeText\b' or some such. [23:08:23] matt_flaschen: I dunno. I just work here. :-) [23:08:27] ebernhardson: or to find stuff with at least one more character after it. Or to search for multiple [23:08:33] I use it quite a lot on-wiki. [23:08:44] currently I get a lot of false positives. [23:08:50] (03CR) 10Dzahn: [C: 032] rsync bast1001 home dirs to osmium, temp for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/283102 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [23:09:01] (03CR) 10Mattflaschen: [C: 032] Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 (owner: 10Mattflaschen) [23:09:28] (03Merged) 10jenkins-bot: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 (owner: 10Mattflaschen) [23:10:53] everything is broken.. sigh "Could not evaluate: Could not find init script or upstart conf file for 'rsync'" [23:11:24] trusty.. pff [23:12:09] !log mattflaschen@tin Synchronized dblists: Make flow dblist explicit, rather than computed (duration: 00m 31s) [23:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:20] Krinkle: you can try tin:/home/ebernhardson/mwgrep, i'm not sure what patterns to try with it though. I get 0 for '\bchangeText\b' but i'm guessing that was more an example and not a real query :) [23:12:30] Confirmed [23:12:34] ebernhardson: It was a real one, but I think I fixed them all. [23:12:59] Actually, only one Apache failed. [23:13:21] mutante, is https://phabricator.wikimedia.org/P2889 related? [23:13:29] One of the web servers failed with my sync just now. [23:14:04] Krinkle: if that script works for you i can push up a puppet patch, but i'm not 100% sure it's right (but seems plausible) [23:14:15] other question is if it should be some flag, or just the new default to use regexp [23:14:26] matt_flaschen: no, it's just "yet another problem" :/ [23:14:45] ebernhardson: /home/ebernhardson/mwgrep '[^$]changeText\b' is matching https://ar.wikipedia.org/wiki/MediaWiki:Gadget-QEditor.js "changeTextbox" [23:14:46] in my case puppet wont even start the rsync service [23:14:56] because there is no init script and no upstart..nothing [23:15:03] (the first part is to exclude Foo.$changeText() which some other gadget has [23:15:09] and i know it used to work (on other distros at least) [23:16:09] !log mattflaschen@tin Synchronized dblists: Make flow dblist explicit, rather than computed, retry (duration: 00m 29s) [23:16:10] can't ever just do something without having to create another ticket [23:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:15] mutante, :( Regarding my issue, do you know how I take a web server out of rotation? The same one failed both times, mw1080.eqiad.wmnet . [23:17:24] of course "unhandled error" is also great [23:17:44] Looks like underlying error is read-only file system on mw1080. [23:18:06] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2201603 (10scfc) Sending mails to mailing lists shouldn't matter as those rewrite the envelope header. The problem lies with forwarders that don't do that, for... [23:18:12] !log Syncing failing only on mw1080 [23:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:33] telecom Italia disappeared from it.wiki's recentchanges [23:19:46] ebernhardson: OK. That doesn't work on-wiki either. Looks like this does it! /home/ebernhardson/mwgrep '[^$]changeText[^a-zA-Z]' [23:19:46] this confirms there's something broken [23:19:55] I guess \b and \s etc. aren't supported? [23:20:53] greg-g, andrewbogott, do you know how I take a web server out of rotation if syncs to it are failing? [23:21:38] Vito: what's the most recent anon edit with a telecom italia IP? [23:24:36] Krinkle: hmm, possible [23:24:47] ebernhardson: Is there documentation on what patterns it supports? [23:24:57] I was looking at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html earlier [23:24:59] not sure if that' related [23:25:43] Krinkle: looking, i think it might not be default elasticsearch stuff [23:26:10] !log depooled mw1080 [23:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:18] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 95 % full [23:26:28] ebernhardson: [23:26:31] [palladium:~] $ confctl --tags dc=eqiad,cluster=appserver,service=apache2 --action set/pooled=no mw1080.eqiad.wmnet [23:26:34] mw1080.eqiad.wmnet: pooled changed inactive => no [23:26:59] mutante: me? [23:27:04] no, sorry [23:27:05] matt_flaschen: [23:27:07] :) [23:27:48] they are also listed in puppet repo ./conftool-data [23:27:52] but https://wikitech.wikimedia.org/wiki/Conftool#Modify_the_state_of_a_server_in_a_pool [23:28:02] (i have not done that much since we use conftool though) [23:28:09] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 38 % full [23:28:12] used to be just the dsh groups [23:28:12] Thanks, mutante [23:28:22] yw [23:29:06] !log mw1080 depooled because read-only fs [23:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:17] PROBLEM - HHVM rendering on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.002 second response time [23:29:38] PROBLEM - Apache HTTP on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [23:30:18] PROBLEM - HHVM processes on mw1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [23:30:32] well, yea [23:30:36] i guess [23:30:49] is that normal now after depooling? [23:31:12] or just because it's broken [23:31:34] creates ticket to fix that hardware [23:32:00] Krinkle: this should be the supported language: https://lucene.apache.org/core/4_10_4/core/org/apache/lucene/util/automaton/RegExp.html [23:32:22] !log mwscript deleteEqualMessages.php --wiki elwiki (T45917) [23:32:23] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:26] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - https://phabricator.wikimedia.org/T132529#2201653 (10Dzahn) [23:33:35] Krinkle: doesn't look like any short-hand character classes are supported [23:34:09] ACKNOWLEDGEMENT - Apache HTTP on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time daniel_zahn https://phabricator.wikimedia.org/T132529 [23:34:09] ACKNOWLEDGEMENT - HHVM processes on mw1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm daniel_zahn https://phabricator.wikimedia.org/T132529 [23:34:09] ACKNOWLEDGEMENT - HHVM rendering on mw1080 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time daniel_zahn https://phabricator.wikimedia.org/T132529 [23:34:22] abartov: OK. that's fine :) [23:34:23] Thanks [23:35:43] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - https://phabricator.wikimedia.org/T132529#2201656 (10Dzahn) 16:22 < matt_flaschen> !log Syncing failing only on mw1080 16:30 < mutante> !log depooled mw1080 16:31 < mutante> [palladium:~] $ confctl --tags dc=eqiad,cluster=appserver,service=apache... [23:36:30] Apr 12 19:54:59 mw1080 kernel: [2362308.579821] sd 0:0:0:0: [sda] [23:36:33] Apr 12 19:54:59 mw1080 kernel: [2362308.579824] Add. Sense: Unrecovered read error - auto reallocate failed [23:36:39] another one bites the dust [23:37:08] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - https://phabricator.wikimedia.org/T132529#2201660 (10Dzahn) ``` root@mw1080:~# tail /var/log/syslog Apr 12 19:54:59 mw1080 kernel: [2362308.579806] Descriptor sense data with sense descriptors (in hex): Apr 12 19:54:59 mw1080 kernel: [2362308.579808]... [23:37:33] 06Operations, 10ops-eqiad: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2201662 (10Dzahn) [23:40:01] (03PS1) 10EBernhardson: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 [23:40:39] (03PS2) 10EBernhardson: Convert mwgrep to use regexp by default [puppet] - 10https://gerrit.wikimedia.org/r/283107 [23:44:36] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2201664 (10Dzahn) [23:45:22] matt_flaschen: I think I do… do you still need it depooled or did I miss the part where someone else fixed it? [23:46:13] andrewbogott, mutante did it. Looks like he also figured out the cause: https://phabricator.wikimedia.org/T132529 [23:46:28] thanks mutante [23:48:48] 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2201690 (10Dzahn) [23:49:49] 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2201716 (10Dzahn) [23:50:31] !log mattflaschen@tin Synchronized php-1.27.0-wmf.21/resources/lib/oojs-ui/: OOjs UI hotfix for search box styling (duration: 00m 27s) [23:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:38] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 95 % full [23:51:07] 06Operations, 13Patch-For-Review: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2201719 (10Dzahn) as usual something else is broken so we can't "just do this" -> T132532 [23:51:10] It would be nice if depooling the web server would also de-pool it from sync-dir/sync-file/scap. [23:51:14] James_F, please test. [23:51:54] Checking, [23:52:29] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 29 % full [23:53:17] matt_flaschen: Yup, works. [23:53:20] matt_flaschen: Thank you! [23:53:31] No problem [23:53:37] SWAT complete [23:57:15] (03PS1) 10Dzahn: use tungsten, not osmium to copy bast home dirs [puppet] - 10https://gerrit.wikimedia.org/r/283111 (https://phabricator.wikimedia.org/T123721) [23:58:31] (03PS2) 10Dzahn: use tungsten, not osmium to copy bast home dirs [puppet] - 10https://gerrit.wikimedia.org/r/283111 (https://phabricator.wikimedia.org/T123721) [23:59:08] (03CR) 10Dzahn: [C: 032] use tungsten, not osmium to copy bast home dirs [puppet] - 10https://gerrit.wikimedia.org/r/283111 (https://phabricator.wikimedia.org/T123721) (owner: 10Dzahn) [23:59:30] !log mwscript deleteEqualMessages.php --wiki fawiktionary [23:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:56] 06Operations, 10Mathoid, 06Services: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2201751 (10Physikerwelt) 05Open>03Resolved a:03Physikerwelt