[00:00:00] (03PS2) 10Addshore: WIP Add grafana_json_datasource [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) [00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161118T0000). [00:06:04] godog: I managed to push a draft up continuing with my grafana simple json datasource stuff that started a few weeks ago. You can find it at https://gerrit.wikimedia.org/r/#/c/322220/ General puppet review would be great! :) [00:06:56] addshore: ok! thanks for the heads up, I might get to it tomorrow [00:07:28] godog: awesome! No real rush :) [00:07:47] (03PS4) 10Reedy: Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 (https://phabricator.wikimedia.org/T140852) [00:07:52] (03CR) 10Reedy: [C: 032] Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [00:07:56] Might aswell get that one out [00:08:58] (03Merged) 10jenkins-bot: Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [00:09:36] (03PS5) 10Filippo Giunchedi: [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [00:10:06] !log reedy@tin Synchronized wmf-config/extension-list: More extensions to extension.json (duration: 00m 48s) [00:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:12] !log reedy@tin Synchronized wmf-config/CommonSettings.php: wfLoadExtension for numerous extensions (duration: 00m 48s) [00:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:45] (03PS6) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [00:17:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:21:58] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:13] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/4610/" [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [00:45:53] (03CR) 10Filippo Giunchedi: "LGTM, @jcrespo what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [00:46:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [00:49:00] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:59:42] (03CR) 10Gergő Tisza: "This would mean different password policies / 2FA flag on different SUL wikis, wouldn't it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [01:00:15] (03CR) 10Gergő Tisza: "Also, doesn't seem to take global groups into account." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [01:00:52] (03CR) 10Reedy: "Global Groups have oathauth enabled via https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [01:35:36] (03PS1) 10Yuvipanda: labstore: Add account_services to secondary too [puppet] - 10https://gerrit.wikimedia.org/r/322224 [01:45:53] win 18 [01:57:25] (03PS2) 10Yuvipanda: labstore: Add account_services to secondary too [puppet] - 10https://gerrit.wikimedia.org/r/322224 (https://phabricator.wikimedia.org/T151014) [01:58:26] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2804750 (10Dzahn) well, going by the original task description example: was: @phab2001:~# ssh 10.64.32.150 ssh: connect to host 10.64.32.150 port 22: No route to h... [01:59:44] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2804753 (10Dzahn) 05Open>03Resolved ``` [phab2001:~] $ ssh 10.64.32.150 Password: [iridium:~] $ ssh phab2001.codfw.wmnet The authenticity of host 'phab2001.co... [02:00:07] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [02:00:11] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2804767 (10Dzahn) [02:07:31] 06Operations, 10Continuous-Integration-Infrastructure: contint: move .htacess file for doc.wm into regular Apache config - https://phabricator.wikimedia.org/T149928#2804784 (10Dzahn) [02:09:55] 06Operations, 10Continuous-Integration-Infrastructure: contint: move .htacess file for doc.wm into regular Apache config - https://phabricator.wikimedia.org/T149928#2804785 (10Dzahn) We had some side effects during the move , very few users might have seen doc.wikimedia.org return an error or an unusual direct... [02:12:42] 06Operations, 10Continuous-Integration-Infrastructure: contint: move .htacess file for doc.wm into regular Apache config - https://phabricator.wikimedia.org/T149928#2804786 (10Dzahn) the advantage of this is that we don't have Apache config anymore that gets created from 2 places in 2 different repos, which co... [02:27:13] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 09m 18s) [02:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:41] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [02:31:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 18 02:31:52 UTC 2016 (duration 4m 40s) [02:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:18] (03PS6) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [02:36:59] (03CR) 10Dzahn: "Volans, ugh, yea, what happened was the upload failed with "modules/base/manifests/standard_packages.pp: needs merge" and needed manual re" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:44:27] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4611/" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [02:45:32] "Error: Failed to parse template prometheus/cluster_config.erb:" when puppet-compiler runs something unrelated on bast4001 [02:45:37] but just there [02:48:20] (03PS6) 10Dzahn: contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:48:36] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2804824 (10Tgr) Re: original size in URL, beside feeling a bit hacky, there are some (minor) disadvantages: generating the URL to the original file bas... [02:48:36] (03CR) 10jenkins-bot: [V: 04-1] contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:50:19] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2804826 (10Tgr) Would the API cover `img_auth.php`? How about stashed uploads and thumbnails thereof (a horrible pile of hacks currently)? Or more gene... [02:50:53] (03PS7) 10Dzahn: contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:51:08] (03CR) 10jenkins-bot: [V: 04-1] contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:54:25] (03PS8) 10Dzahn: contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:55:46] (03CR) 10Dzahn: [C: 04-2] "right, this could not be rebased, contint::packages is not there anymore as it was" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [02:59:54] (03CR) 10Dzahn: "can you please postgresql packages into a separate file under contint/manifests/packages/ (or an existing one if it makes sense). packages" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [03:15:09] (03Abandoned) 10Krinkle: StartProfile: Add try/catch around Xhgui->save() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321907 (owner: 10Krinkle) [03:22:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 715.02 seconds [03:27:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.17 seconds [03:37:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 238.94 seconds [03:45:01] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [03:54:51] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [04:03:11] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:25:01] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [04:34:31] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [04:42:43] (03PS2) 10Dzahn: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [04:43:53] (03CR) 10jenkins-bot: [V: 04-1] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [04:44:44] (03PS3) 10Dzahn: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [04:45:55] (03CR) 10jenkins-bot: [V: 04-1] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [04:46:00] (03PS4) 10Dzahn: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [04:54:41] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:07:00] (03PS5) 10Krinkle: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [06:10:02] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [06:14:51] PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:41] RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:29:43] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [06:45:53] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apt-transport-https] [06:54:33] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ngrep] [07:11:23] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2805005 (10Marostegui) Hey @Papaul That disk looks good now, but there is another one in predictive failure: ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box... [07:12:53] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:21:33] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:24:48] !log installing openjdk-7 security updates on trusty systems [07:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:23] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:32:16] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805016 (10Nemo_bis) It would be useful to clarify whether this task is only about WMF or whether someone is proposing to... [07:33:13] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:34:13] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:13] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:40:13] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:41:03] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:41:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:41:38] PROBLEM - Kafka Broker Server on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [07:42:14] <_joe_> wat? [07:42:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:43:52] <_joe_> !log restarting kafka on kafka1022, too many open files [07:43:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:49:24] huh [07:50:13] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:52:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1018 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:55:26] <_joe_> !log rebooting kafka1022, a shower of defunct processes, kafka refuses to startup again [07:55:43] PROBLEM - jmxtrans on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [07:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:23] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:58:13] PROBLEM - Check whether ferm is active by checking the default input chain on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:13] PROBLEM - Check size of conntrack table on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:13] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:58:13] PROBLEM - configured eth on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:13] PROBLEM - puppet last run on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:14] PROBLEM - DPKG on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:33] PROBLEM - MD RAID on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:33] PROBLEM - Disk space on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:43] PROBLEM - dhclient process on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:43] PROBLEM - salt-minion processes on kafka1022 is CRITICAL: Return code of 255 is out of bounds [07:58:53] PROBLEM - SSH on kafka1022 is CRITICAL: connect to address 10.64.36.122 and port 22: Connection refused [07:59:33] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] [07:59:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [07:59:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [10.0] [08:00:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [10.0] [08:00:04] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [08:00:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [10.0] [08:00:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [08:01:13] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:02:13] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:02:13] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:13] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:02:13] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1022_v4,kafka1022_v6 [08:02:13] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:23] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:24] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:24] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:25] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:26] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:26] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:27] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:27] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:33] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:02:33] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:02:33] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:33] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:33] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:34] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:34] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:35] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:35] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1022_v4,kafka1022_v6 [08:02:53] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:03:03] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1022_v4,kafka1022_v6 [08:03:04] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1022_v4,kafka1022_v6 [08:03:43] PROBLEM - Host kafka1022 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:07] <_joe_> !log physical powercycle of kafka1022 (broken disk) [08:07:25] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2805039 (10hashar) >>@Gehel wrote: > @hashar : you probably have some experience in buildin... [08:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:49] good morning kafka [08:07:58] it is nice to see you again in such a great shape [08:08:01] * elukey checking [08:08:33] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 79.31% of data above the critical threshold [10.0] [08:09:33] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [08:13:13] RECOVERY - Host kafka1022 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:17:59] (03PS1) 10Marostegui: db-codfw.php: Depool db2049 for the weekend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322238 (https://phabricator.wikimedia.org/T150876) [08:19:17] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2049 for the weekend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322238 (https://phabricator.wikimedia.org/T150876) (owner: 10Marostegui) [08:19:51] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2049 for the weekend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322238 (https://phabricator.wikimedia.org/T150876) (owner: 10Marostegui) [08:21:03] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [08:21:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2049 - T150876 (duration: 00m 49s) [08:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] T150876: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876 [08:22:38] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2805079 (10hashar) I thought at first you referred to the Jenkins job that attempt to build... [08:23:06] RECOVERY - mysqld processes on db2049 is OK: PROCS OK: 1 process with command name mysqld [08:24:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [08:24:33] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:25:53] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [08:27:03] RECOVERY - MariaDB Slave IO: s2 on db2049 is OK: OK slave_io_state Slave_IO_Running: Yes [08:27:04] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2805084 (10Marostegui) I have started MySQL and let it recover, as there was no errors. I have started replication. Even though the burning tests were fine, I have depooled the... [08:27:13] RECOVERY - MariaDB Slave SQL: s2 on db2049 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:28:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [08:30:23] RECOVERY - Check size of conntrack table on kafka1022 is OK: OK: nf_conntrack is 0 % full [08:30:23] RECOVERY - configured eth on kafka1022 is OK: OK - interfaces up [08:30:23] RECOVERY - DPKG on kafka1022 is OK: All packages OK [08:30:23] RECOVERY - Check whether ferm is active by checking the default input chain on kafka1022 is OK: OK ferm input default policy is set [08:30:23] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 56 minutes ago with 0 failures [08:30:33] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [08:30:33] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [08:30:33] RECOVERY - MD RAID on kafka1022 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:30:33] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [08:30:33] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [08:30:33] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [08:30:34] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [08:30:34] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [08:30:35] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [08:30:35] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [08:30:36] RECOVERY - Disk space on kafka1022 is OK: DISK OK [08:30:43] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [08:30:43] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [08:30:43] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [08:30:43] RECOVERY - jmxtrans on kafka1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [08:30:43] RECOVERY - dhclient process on kafka1022 is OK: PROCS OK: 0 processes with command name dhclient [08:30:43] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [08:30:44] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [08:30:44] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [08:30:45] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [08:30:45] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [08:30:46] RECOVERY - salt-minion processes on kafka1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:30:53] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [08:30:53] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [08:30:53] RECOVERY - SSH on kafka1022 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:30:53] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:30:53] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [08:30:53] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [08:30:54] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [08:31:03] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [08:31:03] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [08:31:03] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [08:31:03] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [08:31:03] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [08:31:03] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [08:31:04] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [08:31:13] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [08:31:13] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [08:31:13] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [08:31:13] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [08:31:23] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [08:31:23] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [08:31:23] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [08:31:23] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [08:31:23] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [08:31:23] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [08:31:24] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [08:31:24] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [08:31:25] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [08:31:25] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [08:31:26] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [08:31:26] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [08:31:27] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [08:33:14] !log kafka1022 up and running with kafka* daemon masked and broken disk removed from fstab (we mount partitions in there using UUIDs) [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:14] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2805089 (10hashar) Looking at debian/patches/0004-use-wmf-archiva.patch it disables maven... [08:38:07] (03PS20) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [08:42:48] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2805092 (10elukey) [08:43:26] jynus: are you around ? [08:43:50] (03PS1) 10Marostegui: db-codfw.php: Depool db2070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322240 (https://phabricator.wikimedia.org/T149553) [08:45:12] marostegui: fyi, i am about to rename an account with 75k+ edits. can please be around and verify the sky doesn't fall ? [08:45:27] matanya: in which shard? [08:45:38] globally [08:45:43] ah roger [08:45:55] matanya: Do you have a ticket number so I can get some context? [08:47:08] marostegui: no ticjet, i can create one. most edits on pt.wiki [08:47:50] matanya: So what is the plan then? [08:47:54] marostegui: see this morning you didn't do your usual alter table and other people do it for you [08:48:01] :P [08:48:14] elukey: The Clinic duty week is getting me distracted from my lovely morning alter tables! :p [08:48:45] marostegui: plan -> click rename user in the UI :) [08:48:58] XDD [08:49:07] matanya there is a process now [08:49:12] for those jobs [08:49:14] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Kafka1022 needs a new disk - https://phabricator.wikimedia.org/T151028#2805117 (10elukey) p:05Triage>03High [08:49:20] it has to be added to Deployments [08:49:38] jynus: long running jobs ? [08:49:44] yes [08:49:51] on the weekly part [08:50:08] https://wikitech.wikimedia.org/wiki/Deployments#Week_of_November_14th [08:50:13] so sechedule now and wait, or run? [08:50:26] see there is schema changes and job runs [08:50:47] "schema change on..." [08:50:59] "Populate X..." [08:51:07] i will update mediawiki message to include this info [08:51:38] however, I would ask you to wait a bit right now [08:51:49] because of the ongoing enwiki change [08:51:58] ok. let me know when is a good time [08:52:01] (it should finish in some minutes) [08:52:13] thanks jynus for all the information :) [08:52:33] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:52:43] are you going to be online later? [08:52:53] like in an hour or so? [08:53:13] matanya ^? [08:53:48] likely jynus [08:54:03] I can ping you when it finishes [08:54:57] but add it to deployments so nobody takes your "place" :-) [08:55:43] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:57:55] 06Operations, 10hardware-requests: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#2805119 (10MoritzMuehlenhoff) a:03Cmjohnson [09:02:58] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322240 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [09:03:34] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322240 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [09:03:47] ok jynus thanks [09:05:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2070 - T149553 (duration: 00m 48s) [09:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:24] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [09:06:23] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T150973#2805147 (10Volans) [09:09:04] (03CR) 10Mobrovac: PDF Render Service: Role and module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [09:11:32] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T150973#2805150 (10Marostegui) That disk was in predictive failure and was changed by @Papaul yesterday, so that is why it is probably marked as failed yesterday. Now it looks good (even though there is another one... [09:14:40] !log Stopping MySQL on db2070 to use it to clone another host - https://phabricator.wikimedia.org/T149553 [09:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805155 (10mobrovac) >>! In T149408#2805016, @Nemo_bis wrote: > It would be useful to clarify whether this task is only a... [09:26:10] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751310 (10Legoktm) >>! In T149408#2767842, @mobrovac wrote: > I guess my real question here is what is the real meaning... [09:26:15] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805196 (10Joe) >>! In T149408#2805016, @Nemo_bis wrote: > It would be useful to clarify whether this task is only about... [09:26:31] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2805197 (10akosiaris) >>! In T150882#2802771, @Cmjohnson wrote: > I find it odd that so many servers are seeing these overheating issues. I agree. FWIW the server' s CPU usage does not explai... [09:27:01] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805198 (10Legoktm) >>! In T149408#2753327, @Pchelolo wrote: > @bd808 What are the use-cases you have in mind that are no... [09:27:13] (03CR) 10Hashar: [C: 032 V: 032] Symlink fonts for ploticus [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/321560 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [09:28:38] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Volans kafka1022 has a broken disk https://phabricator.wikimedia.org/T151028 [09:28:38] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Volans kafka1022 has a broken disk https://phabricator.wikimedia.org/T151028 [09:28:38] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Volans kafka1022 has a broken disk https://phabricator.wikimedia.org/T151028 [09:28:38] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Volans kafka1022 has a broken disk https://phabricator.wikimedia.org/T151028 [09:28:39] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] Volans kafka1022 has a broken disk https://phabricator.wikimedia.org/T151028 [09:30:01] (03PS2) 10Hashar: Drop '.ttf' from $wgTimelineFontFile settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [09:36:26] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805237 (10Nemo_bis) >>! In T149408#2805155, @mobrovac wrote: >> Some of the above comments seem to use "small wikis" for... [09:38:38] (03PS21) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [09:41:35] (03PS2) 10Hashar: Move EasyTimeline config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321493 (https://phabricator.wikimedia.org/T22825) [09:41:47] (03PS2) 10Hashar: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) [09:41:52] (03PS3) 10Hashar: Drop '.ttf' from $wgTimelineFontFile settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [09:42:03] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2805241 (10Nemo_bis) [09:42:24] (03CR) 10jenkins-bot: [V: 04-1] Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [09:52:07] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:54:07] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [09:56:07] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:19] !log performing schema change on dbstore2001:commonswiki/page (ALGORITHM=COPY) [10:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:39] (03PS22) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [10:19:43] interestingly enough, doing that^ creates lags on other dbstore2001 shards [10:22:12] (03PS23) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [10:24:07] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:24:52] (03PS24) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [10:33:34] (03CR) 10Mobrovac: PDF Render Service: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [10:42:11] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2805477 (10Gilles) >>! In T66214#2804824, @Tgr wrote: > Re: type selection, we should use real file extensions (`/2fd4e1c67a2d28fced849ee1bb76e7391b93e... [10:43:40] jynus: I am clear ? [10:44:00] one sec [10:44:51] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2805495 (10Gilles) >>! In T66214#2803304, @GWicke wrote: > However, in order to be able to select such a conversion where needed, clients will either n... [10:52:52] matanya, my job seem "stuck" [10:53:17] jynus: i can postpone it to next week, no worries [10:53:52] if it is important running it now, if you want to be safe, I would wait 6-12 hours [10:54:29] oh wait [10:54:36] it was not freezed [10:54:43] I was only on screen copy mode [10:54:51] it is about to finish [10:54:57] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:56:01] 30 minutes, tops [10:57:43] ok jynus I would love if legoktm could help us make this process automated [10:58:02] hmm? [10:58:55] legoktm: i am talking about renaming users with more than 50k edits [10:59:13] actually, the right fix [10:59:20] is to not do a rename at all [10:59:27] currently i need to schedule it, annoy a dba, and follow the rename [10:59:44] I do not know if I get understood [10:59:53] that is hard to sell to the community jynus [10:59:58] I mean changing the schema [11:00:04] so that a rename is 1 click [11:00:10] is that really hard to sell? [11:00:14] oh, yeah, that would be great [11:00:29] i thought you meant no renames at all [11:00:40] T33863 [11:00:40] T33863: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863 [11:00:41] it is hard to sell to the devs that would have to change lots of code [11:00:49] :-) [11:01:01] legoktm, yep [11:02:24] that not only would make users happy, DBAs too [11:02:48] 10% reduction on space used, maybe [11:04:38] i'll put it on community wishlist :) [11:05:50] I am going to be bold and say that if that is renormalized and the *links tables and revision are, too, we reduce db space by 50% [11:08:07] (03PS1) 10Legoktm: Set $wgUserEmailUseReplyTo = true; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322243 (https://phabricator.wikimedia.org/T66795) [11:08:58] 06Operations, 10Mail, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 05Security: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#2805575 (10Legoktm) [11:20:01] (03CR) 10Hashar: "Ideally you would want to push the upstream branch to our 'upstream' one then tag their 1.11.0 with upstream/1.11.0." [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320991 (owner: 10Gehel) [11:21:13] (03CR) 10MarcoAurelio: "Support & Safety requests that global renamers local group at Meta-Wiki be allowed to use 2fa as well. Can those be added to the patch set" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [11:21:57] RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:26:38] (03PS1) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322247 (https://phabricator.wikimedia.org/T150618) [11:33:24] 06Operations: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045#2805642 (10MoritzMuehlenhoff) [11:35:18] (03PS1) 10Volans: RAID: get megacli status of physical disks too [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) [11:36:29] 06Operations: Fully puppetise yubikey-val - https://phabricator.wikimedia.org/T151046#2805660 (10MoritzMuehlenhoff) [11:36:34] (03PS2) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322247 (https://phabricator.wikimedia.org/T150618) [11:39:35] 06Operations: Integrate Yubikey into data.yaml - https://phabricator.wikimedia.org/T151047#2805676 (10MoritzMuehlenhoff) [11:41:57] 06Operations: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048#2805692 (10MoritzMuehlenhoff) [11:44:25] (03PS3) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322247 (https://phabricator.wikimedia.org/T150618) [11:45:43] 06Operations: Run systematic availability tests - https://phabricator.wikimedia.org/T151049#2805707 (10MoritzMuehlenhoff) [11:48:00] 06Operations: Proper documentation - https://phabricator.wikimedia.org/T151050#2805722 (10MoritzMuehlenhoff) [11:49:29] (03PS1) 10Zhuyifei1999: Set $wgForeignUploadTargets to [ 'local' ] for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) [11:50:21] (03PS4) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322247 (https://phabricator.wikimedia.org/T150618) [11:53:55] matanya, you are good to go [11:54:08] sorry, got distracted, too many servers crashing [11:54:14] sadly, literally [11:58:09] (03PS1) 10Ema: cache: get rid of varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) [11:59:16] (03CR) 10jenkins-bot: [V: 04-1] cache: get rid of varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [12:00:20] back in a little while, cat-sitting and food run [12:02:46] (03PS2) 10Ema: cache: get rid of varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) [12:10:02] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [12:11:48] 06Operations, 10Traffic, 13Patch-For-Review: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660#2805778 (10ema) [12:12:36] (03CR) 10Mobrovac: [C: 04-1] Kartotherian: deploy application configuration with scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [12:14:23] (03CR) 10Ema: "pcc output here, one host per cluster/DC: https://puppet-compiler.wmflabs.org/4612/" [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [12:16:26] (03PS2) 10Volans: RAID: get megacli status of physical disks too [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) [12:34:59] 06Operations, 06Discovery, 10Kartotherian, 06Maps, 03Interactive-Sprint: Deploy libmapnik3.0 deb package to all maps servers - https://phabricator.wikimedia.org/T150722#2805820 (10faidon) [12:45:11] 06Operations, 07Documentation: Proper documentation for Yubico 2FA for production use - https://phabricator.wikimedia.org/T151050#2805839 (10Aklapper) [12:46:38] (03PS1) 10Elukey: Move definitions to header files for a better code readability [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322256 [12:46:40] (03PS1) 10Elukey: [WIP] Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [12:51:13] (03PS2) 10Elukey: [WIP] Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [12:54:55] jynus: i checked with the team about the drop in row reads. we didn't deploy last week and they can't find any commit that seems relevant. [12:54:58] something in core? [12:58:35] Lydia_WMDE: o/ [12:59:09] hey [13:06:43] (03CR) 10Steinsplitter: [C: 031] "looks OK. (not tested)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) (owner: 10Zhuyifei1999) [13:30:39] Lydia_WMDE, actually, that would be a good sign [13:30:44] the problem comes [13:31:16] from having 10x more reads on wikidata (?) than on all other wikis toghether [13:35:42] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=8&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=All&from=1479465287244&to=1479476087244 [13:51:53] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T150973#2805997 (10Marostegui) 05Open>03Invalid We are following the latest events of this box here: T150511#2805005 [13:53:08] jynus: whut? [13:53:11] mpfh [13:54:34] I do not have anything concrete, but the numbers are so high, that maybe you could give it a look [13:54:38] (03PS6) 10Hashar: Only run puppet-lint against HEAD by default [puppet] - 10https://gerrit.wikimedia.org/r/288629 [13:55:20] not saying it is wikidata by itself, could be many things: users excesive usage, etc. [13:56:39] but at points it reaches 5000 million rows read per second [13:57:11] sorry, no [13:57:20] that is 5 million rows/s [13:58:22] (03CR) 10Hashar: "Added some comment inline to help the review. This change is a follow up of I8302b30a2363139918e815eabccd4ecd4dd4e702 which added puppet-" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [14:02:57] *nod* [14:03:44] (03CR) 10Bartosz Dziewoński: [C: 031] "Thanks, you beat me to it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322251 (https://phabricator.wikimedia.org/T139257) (owner: 10Zhuyifei1999) [14:06:14] (03PS10) 10Hashar: Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [14:06:27] jynus: ok emailed the team again (everyone remote today...). let's see [14:08:27] (03CR) 10Jcrespo: [C: 031] "I was actually the person asking for this to be added, I will deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [14:09:05] (03Draft2) 10MarcoAurelio: Remove FlaggedRevs autopromotion function at eowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) [14:09:11] (03Draft1) 10MarcoAurelio: Remove FlaggedRevs autopromotion function at eowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) [14:12:11] (03PS3) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [14:16:23] (03CR) 10jenkins-bot: [V: 04-1] Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [14:16:31] (03PS3) 10Rush: toollabs: Fix maintain-kubeusers crashing [puppet] - 10https://gerrit.wikimedia.org/r/322213 (https://phabricator.wikimedia.org/T150946) (owner: 10Yuvipanda) [14:16:33] (03PS3) 10Rush: labstore: Add account_services to secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/322224 (https://phabricator.wikimedia.org/T151014) (owner: 10Yuvipanda) [14:16:56] (03PS4) 10Rush: toollabs: Fix maintain-kubeusers crashing [puppet] - 10https://gerrit.wikimedia.org/r/322213 (https://phabricator.wikimedia.org/T150946) (owner: 10Yuvipanda) [14:17:50] (03CR) 10Rush: [C: 032 V: 032] toollabs: Fix maintain-kubeusers crashing [puppet] - 10https://gerrit.wikimedia.org/r/322213 (https://phabricator.wikimedia.org/T150946) (owner: 10Yuvipanda) [14:18:44] (03PS2) 10Jcrespo: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [14:19:09] (03PS4) 10Rush: labstore: Add account_services to secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/322224 (https://phabricator.wikimedia.org/T151014) (owner: 10Yuvipanda) [14:19:33] !log phabricator: deploying more fixes from upstream/stable to wmf/stable. Fixes T150992 [14:19:36] (03CR) 10Rush: [C: 032 V: 032] labstore: Add account_services to secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/322224 (https://phabricator.wikimedia.org/T151014) (owner: 10Yuvipanda) [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:58] T150992: Error creating Milestone: Argument passed must be array, string given, called in /PhabricatorPHIDListEditField.php:58 - https://phabricator.wikimedia.org/T150992 [14:22:10] (03PS3) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [14:22:23] (03PS3) 10Jcrespo: Stagger parser cache purges to avoid lag [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [14:23:11] (03CR) 10Hashar: [C: 04-1] "wip" [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [14:23:30] (03PS3) 10MarcoAurelio: Remove FlaggedRevs autopromotion function at eowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) [14:24:50] (03PS4) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [14:27:27] (03PS2) 10Elukey: Move definitions to header files for a better code readability [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322256 (https://phabricator.wikimedia.org/T147440) [14:27:29] (03PS5) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) [14:30:58] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [14:31:32] jynus ^ can that be related to the ticket we worked on? [14:31:52] it could be, chase merged [14:31:58] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [14:32:34] it's in progress jynus & marostegui there may be more changes to convert to new setup, I'm silencing as we speak but no worries atm [14:32:43] Ah, cool thanks [14:33:58] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - create-dbusers is active [14:38:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - create-dbusers is active [14:39:38] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2806082 (10grin) Thanks for the reminder, I've got a word back from MQ, and they said, that in 2014 MapQuest served **380 million Open Tiles per day**, 9.3 million Op... [14:39:57] marostegui: race condition in the module where it tried to start the service before installing the python ldap modules it seems :) [14:49:08] (03PS1) 10Rush: labstore: dependency issues with account_services [puppet] - 10https://gerrit.wikimedia.org/r/322264 [14:49:30] (03PS2) 10Rush: labstore: dependency issues with account_services [puppet] - 10https://gerrit.wikimedia.org/r/322264 [15:20:07] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring: Rename a few variables [puppet] - 10https://gerrit.wikimedia.org/r/322107 (owner: 10Alexandros Kosiaris) [15:20:11] (03PS2) 10Alexandros Kosiaris: monitoring: Rename a few variables [puppet] - 10https://gerrit.wikimedia.org/r/322107 [15:20:14] (03CR) 10Alexandros Kosiaris: [V: 032] monitoring: Rename a few variables [puppet] - 10https://gerrit.wikimedia.org/r/322107 (owner: 10Alexandros Kosiaris) [15:23:01] (03CR) 10Volans: [C: 04-1] "I'm working on a further improvement" [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) (owner: 10Volans) [15:23:19] dapatrick: you in security? can I pm? [15:25:43] mafk: what's up/ [15:26:02] Reedy: it's the "issue" again [15:26:15] PM me if you want [15:26:18] ok [15:26:45] (03PS3) 10Rush: labstore: dependency issues with account_services [puppet] - 10https://gerrit.wikimedia.org/r/322264 [15:27:55] (03CR) 10Rush: [C: 032] labstore: dependency issues with account_services [puppet] - 10https://gerrit.wikimedia.org/r/322264 (owner: 10Rush) [15:29:56] (03PS1) 10Elukey: Force a 404 on each HTTP request landing to a non configured domain [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) [15:31:55] (03CR) 10Elukey: [C: 04-1] "Testing needed, will try it in deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) [15:35:42] (03PS1) 10Hashar: nodepool: bump max server from 12 to 20 [puppet] - 10https://gerrit.wikimedia.org/r/322270 (https://phabricator.wikimedia.org/T133911) [15:43:39] (03PS1) 10Giuseppe Lavagetto: Initial debianization [calico-cni] - 10https://gerrit.wikimedia.org/r/322273 [15:46:36] 06Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Traffic, 13Patch-For-Review: piuparts fail with WARN: Broken symlinks: /etc/systemd/system... - https://phabricator.wikimedia.org/T141454#2806271 (10hashar) 05Open>03Resolved a:03hashar Not really solved, but piuparts is no more... [15:47:32] 06Operations, 06Discovery, 10Kartotherian, 06Maps, 03Interactive-Sprint: Deploy libmapnik3.0 deb package to all maps servers - https://phabricator.wikimedia.org/T150722#2806276 (10Gehel) Thanks @faidon ! I'll add that package soon and we'll see about testing the node 6 upgrade... [15:48:30] (03CR) 10BBlack: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/322252 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [15:56:01] (03PS1) 10Gehel: maps / kartotherian: libmapnik3.0 is required for the upgrade to nodejs 6 [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) [15:56:08] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:28] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2806307 (10Cmjohnson) Replaced the PSU, return shipment tracking is 1ZW0948Y9081215654 [15:58:27] (03CR) 10Gehel: "@yurik: it looks to me that kartotherian requires libmapnik3.0, but not tilerator. Can you confirm?" [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) (owner: 10Gehel) [15:59:53] (03PS1) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provide (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150527) [16:00:59] (03CR) 10jenkins-bot: [V: 04-1] Make systemd-timesyncd available as an alternative time synchronisation provide (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150527) (owner: 10Muehlenhoff) [16:01:13] (03PS2) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150527) [16:02:13] (03PS3) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150527) [16:03:11] 06Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001#2806332 (10Aklapper) [16:04:06] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2806333 (10Anomie) >>! In T149408#2805016, @Nemo_bis wrote: > It would be useful to clarify whether this task is only abo... [16:05:42] RECOVERY - Host labsdb1009 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:06:54] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, 03Interactive-Sprint: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2806341 (10Gehel) [16:08:29] (03PS1) 10Gilles: Don't send client caching headers for successful thumbnails in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/322280 (https://phabricator.wikimedia.org/T150642) [16:08:59] moritzm: sure that you hit the right phab task at https://gerrit.wikimedia.org/r/#/c/322279/ ? [16:12:02] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:05] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2806370 (10jcrespo) Sadly, it still doesn't allow to boot from the disk device, and when going to to the hp raid configuration utility it says: ``` error: no such device: HPEZC... [16:12:30] (03PS9) 10Paladox: contint: add postgresql to contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [16:16:37] (03PS1) 10Gehel: granting access to analytics-privatedata-users for user discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/322282 (https://phabricator.wikimedia.org/T151063) [16:18:21] (03PS1) 10Andrew Bogott: Wikistatus: [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [16:18:28] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should know about the lossless thumbnail parameter - https://phabricator.wikimedia.org/T150758#2806384 (10Gilles) [16:18:38] Sagan: oh, there's been a typo, will fix the task number, thanks! [16:18:48] np :) [16:19:09] (03PS2) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [16:19:17] (03PS4) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) [16:19:19] (03PS3) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [16:21:37] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should know about the lossless thumbnail parameter - https://phabricator.wikimedia.org/T150758#2806408 (10Gilles) [16:25:12] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:30:39] (03PS10) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [16:34:56] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2806341 (10Krenair) You mean it's a system user rather than a human account that can be SSHd into? I'd be s... [16:38:26] 06Operations, 06Performance-Team, 10Thumbor: Investigate whether we need a repeat failure guard and/or a poolcounter-like behavior in Thumbor - https://phabricator.wikimedia.org/T150745#2795243 (10Gilles) @krinkle reminded me of the (presumably per-IP) rate limiter for uncached thumbnails. I need to figure o... [16:40:27] RECOVERY - swift-object-updater on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:40:27] RECOVERY - swift-container-updater on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:40:27] RECOVERY - swift-container-server on ms-be1027 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:40:27] RECOVERY - swift-container-auditor on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:40:37] RECOVERY - swift-object-replicator on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:40:37] RECOVERY - swift-object-auditor on ms-be1027 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:40:37] RECOVERY - swift-object-server on ms-be1027 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:40:37] RECOVERY - swift-account-auditor on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:40:37] RECOVERY - swift-account-reaper on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:40:47] RECOVERY - swift-container-replicator on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:40:47] RECOVERY - Disk space on ms-be1027 is OK: DISK OK [16:40:47] RECOVERY - swift-account-replicator on ms-be1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:41:07] RECOVERY - swift-account-server on ms-be1027 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:41:07] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:41:27] 06Operations, 06Performance-Team, 10Thumbor: Reimplement various rate-limiting mechanisms in Thumbor - https://phabricator.wikimedia.org/T150745#2806445 (10Gilles) [16:41:59] 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter - https://phabricator.wikimedia.org/T151065#2806447 (10Gilles) [16:42:16] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support - https://phabricator.wikimedia.org/T151066#2806461 (10Gilles) [16:42:34] 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter - https://phabricator.wikimedia.org/T151067#2806475 (10Gilles) [16:43:13] 06Operations, 06Performance-Team, 10Thumbor: Investigate differences in status codes between thumbor and image scalers - https://phabricator.wikimedia.org/T150641#2806491 (10Gilles) [16:43:15] 06Operations, 06Performance-Team, 10Thumbor: Reimplement various rate-limiting mechanisms in Thumbor - https://phabricator.wikimedia.org/T150745#2795243 (10Gilles) 05Open>03Resolved Closing this parent task, the child tasks are enough. [16:46:07] (03PS4) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [16:55:06] (03PS5) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [17:01:10] (03PS6) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [17:01:14] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't render a few SVGs that Mediawiki can - https://phabricator.wikimedia.org/T150754#2806516 (10Gilles) [17:02:37] (03CR) 10BryanDavis: [C: 031] Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [17:03:13] (03PS7) 10Andrew Bogott: Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) [17:05:39] (03CR) 10Andrew Bogott: [C: 032] Wikistatus: fix more races with mwclient [puppet] - 10https://gerrit.wikimedia.org/r/322283 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [17:10:57] (03PS1) 10Alexandros Kosiaris: icinga: Fix the typo in check_systemd_state [puppet] - 10https://gerrit.wikimedia.org/r/322288 [17:15:27] (03CR) 10Jcrespo: "I intended to deploy this, I got distracted. Now it is too late. This can go any time, but I do not want to merge it and go away for the w" [puppet] - 10https://gerrit.wikimedia.org/r/320928 (https://phabricator.wikimedia.org/T150124) (owner: 10Aaron Schulz) [17:23:41] 06Operations, 06Performance-Team, 10Thumbor: Improve Content-Disposition - https://phabricator.wikimedia.org/T151072#2806593 (10Gilles) [17:26:44] (03PS2) 10Elukey: Force a 404 on each HTTP request landing to a non configured domain [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) [17:27:59] (03PS1) 10Ori.livneh: Revert "Don't use AbuseFilterCachingParser on bgwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322289 (https://phabricator.wikimedia.org/T148660) [17:28:53] (03PS3) 10Elukey: Force a 404 on each HTTP request landing to a non configured domain [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) [17:30:12] (03CR) 10Ori.livneh: [C: 032] Revert "Don't use AbuseFilterCachingParser on bgwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322289 (https://phabricator.wikimedia.org/T148660) (owner: 10Ori.livneh) [17:30:32] (03PS1) 10Andrew Bogott: Wikistatus: try deleting pages on delete.end rather than delete.start [puppet] - 10https://gerrit.wikimedia.org/r/322290 [17:31:13] (03Merged) 10jenkins-bot: Revert "Don't use AbuseFilterCachingParser on bgwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322289 (https://phabricator.wikimedia.org/T148660) (owner: 10Ori.livneh) [17:32:34] (03CR) 10Andrew Bogott: [C: 032] Wikistatus: try deleting pages on delete.end rather than delete.start [puppet] - 10https://gerrit.wikimedia.org/r/322290 (owner: 10Andrew Bogott) [17:34:14] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: If3b80b1a: Revert "Don't use AbuseFilterCachingParser on bgwiki" (T148660) (duration: 00m 50s) [17:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:37] T148660: Stack overflow in AbuseFilter when using AbuseFilterCachingParser - https://phabricator.wikimedia.org/T148660 [17:34:48] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10fgiunchedi) I was confused by that message too @jcrespo, though it is sufficient to wait for the underlying linux to fully boot. You'll be dropped into `hpssacli` aft... [17:48:53] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2806644 (10jcrespo) Thanks, @fgiunchedi, but I have not advanced much: ``` => controller slot=0 rescan => controller slot=1 pd all show status Error: The specifie... [17:51:09] (03PS3) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) [17:59:06] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2806649 (10jcrespo) Compare with the equivalent, well-working, labsdb1010: ``` => controller slot=1 pd all show status physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 1600.3 GB... [18:00:30] 06Operations, 06Performance-Team, 10Thumbor: Improve Content-Disposition - https://phabricator.wikimedia.org/T151072#2806650 (10Gilles) a:05fgiunchedi>03Gilles [18:03:19] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2806654 (10GWicke) >>! In T66214#2805495, @Gilles wrote: >>>! In T66214#2803304, @GWicke wrote: >> However, in order to be able to select such a conver... [18:10:09] ori you here? [18:10:21] not really [18:10:29] is it urgent? [18:11:03] ori: no, can i drop you a mail? [18:11:24] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2806341 (10Dzahn) Indeed, an access request for a system user rather than a human is unusual. But that does... [18:11:45] sure, you have my e-mail? [18:11:54] no [18:12:07] ori.livneh@gmail.com [18:12:18] thx [18:15:05] bd808: ? [18:15:41] Zppix: is there a reason for the ping? [18:16:06] bd808: that jouncebot change [18:16:10] Is it good? [18:16:53] (03CR) 10Dzahn: [C: 031] "This very much looks like the solution to what i once described with too many words on T137176 and almost forgot again already. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) [18:16:58] I haven't had time to take a look yet, sorry. I may get to it today but sometime over the weekend is more likely. [18:17:04] (03PS3) 10Volans: RAID: get RAID status improvement for MegaCLI [puppet] - 10https://gerrit.wikimedia.org/r/322249 (https://phabricator.wikimedia.org/T151043) [18:17:17] Ok [18:18:28] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_statsv_gmond_pyconf] [18:18:51] (03CR) 10Dzahn: "@BBlack you once said to me how we should return a 404 on these, that's where that ticket came from and what Elukey is fixing now. Does it" [puppet] - 10https://gerrit.wikimedia.org/r/322268 (https://phabricator.wikimedia.org/T137176) (owner: 10Elukey) [18:20:54] 06Operations, 06Performance-Team, 10Thumbor: Implement PoolCounter support - https://phabricator.wikimedia.org/T151066#2806698 (10Gilles) p:05High>03Normal [18:22:51] !log removing mysql-test dir from silver to free up some space there [18:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:18] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:20] 06Operations, 10EventBus, 06Services (watching): eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524#2806703 (10Krinkle) [18:27:00] !log deployed unix_auth on silver (labswiki) T150446 [18:27:22] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2806710 (10Krinkle) [18:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] 06Operations, 10Monitoring: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801#2806711 (10Krinkle) [18:28:02] (03PS1) 10Andrew Bogott: wikistatus: rearrange page-deletion logic [puppet] - 10https://gerrit.wikimedia.org/r/322296 [18:28:45] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labmon1001 graphite instance archiver keeps archiving the same instances - https://phabricator.wikimedia.org/T120377#2806718 (10Krinkle) [18:29:55] (03CR) 10Andrew Bogott: [C: 032] wikistatus: rearrange page-deletion logic [puppet] - 10https://gerrit.wikimedia.org/r/322296 (owner: 10Andrew Bogott) [18:31:58] (03CR) 10Florianschmidtwelzow: [C: 031] Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [18:33:12] 06Operations, 10Continuous-Integration-Infrastructure, 07Upstream, 07Zuul: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#2806741 (10Krinkle) [18:33:22] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked), 07Wikimedia-Incident: setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2806742 (10RobH) [18:33:45] (03CR) 10Dzahn: [C: 031] "looks good now, thanks. except it's not included in any other class, so nothing would happen until you do that. other package classes are " [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [18:34:07] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked), 07Wikimedia-Incident: setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2806761 (10RobH) [18:35:07] (03CR) 10Paladox: "Yep, it is so that we can get the class created and include the class in a separate patch so releng can approve or decline that patch." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [18:35:10] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked), 07Wikimedia-Incident: setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2806742 (10RobH) [18:35:41] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2806765 (10RobH) a:05Eevans>03RobH [18:36:31] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10RobH) [18:36:33] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2806767 (10RobH) 05Open>03Resolved The setup task of these systems in their reclaimed role has been setup. I had this task open as a reminder until I completed that, s... [18:38:10] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2806772 (10RobH) [18:40:08] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:46] (03PS7) 10Dzahn: Only run puppet-lint against HEAD by default [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [18:43:12] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps, and 2 others: Requesting access to analytics-privatedata-users for technical user discovery-stats - https://phabricator.wikimedia.org/T151063#2806792 (10Gehel) @Dzahn : yes that's a good understanding of what it does. It only publishes very general... [18:43:26] (03CR) 10Dzahn: [C: 032] Only run puppet-lint against HEAD by default [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [18:46:24] (03PS4) 10Chad: Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [18:47:28] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:47:41] 06Operations, 06Performance-Team, 10Traffic, 07Regression: Investigate major HTTP 500 spike since 2016-09-23 - https://phabricator.wikimedia.org/T151078#2806815 (10Krinkle) [18:48:11] (03CR) 10Chad: [C: 032] Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [18:48:27] (03CR) 10Gehel: [C: 032] granting access to analytics-privatedata-users for user discovery-stat [puppet] - 10https://gerrit.wikimedia.org/r/322282 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [18:48:42] (03Merged) 10jenkins-bot: Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [18:48:55] (03CR) 10Paladox: "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [18:49:04] (03CR) 10Gehel: [C: 04-1] "This require validation of T151063 and some discussion before merging" [puppet] - 10https://gerrit.wikimedia.org/r/322282 (https://phabricator.wikimedia.org/T151063) (owner: 10Gehel) [18:51:18] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:52:25] !log demon@tin Synchronized wmf-config/CommonSettings.php: extdist settings for 1.28 (duration: 00m 49s) [18:52:31] (03PS1) 10Dzahn: site.pp/bastions: make includes/roles more readable [puppet] - 10https://gerrit.wikimedia.org/r/322297 [18:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:17] 06Operations, 06Performance-Team, 10Traffic, 07Regression: Investigate major HTTP 500 spike since 2016-09-23 - https://phabricator.wikimedia.org/T151078#2806831 (10BBlack) Pretty sure what you're looking at here is T147648 (also related: T147784) [18:54:27] (03PS1) 10Rush: labsdb: update create-dbusers to run on new cluster [puppet] - 10https://gerrit.wikimedia.org/r/322298 [18:54:43] (03CR) 10jenkins-bot: [V: 04-1] labsdb: update create-dbusers to run on new cluster [puppet] - 10https://gerrit.wikimedia.org/r/322298 (owner: 10Rush) [18:56:02] (03PS2) 10Rush: labsdb: update create-dbusers to run on new cluster [puppet] - 10https://gerrit.wikimedia.org/r/322298 [18:58:05] (03PS3) 10Rush: labsdb: update create-dbusers to run on new cluster [puppet] - 10https://gerrit.wikimedia.org/r/322298 [18:59:53] (03CR) 10Rush: [C: 032] labsdb: update create-dbusers to run on new cluster [puppet] - 10https://gerrit.wikimedia.org/r/322298 (owner: 10Rush) [19:08:34] (03PS2) 10Dzahn: site.pp/bastions: make includes/roles more readable [puppet] - 10https://gerrit.wikimedia.org/r/322297 [19:09:11] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:10:58] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2806890 (10RobH) We've never added a GPU/Card to any of our servers in the past. Many of them may not have the space for them. My understanding is f... [19:12:34] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2806898 (10RobH) Alternatively Chris may be able to check any R430 for the space rather than downtime stat1004, as long as its a same generation R430. [19:16:05] 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#2806905 (10RobH) [19:16:31] godog: fyi, hosts that have ganglia::aggregator (bast3001/bast4001) always fail in puppet compiler with some "Error: Failed to parse template prometheus/cluster_config.erb:".. just saying because i see the word prometheus there [19:17:29] and only in compiler [19:19:33] 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#2806930 (10RobH) Additionally check the power supply output of these systems, as this card requires quite a bit: Thermal and Power Specs: Maximum GPU Temperature (in C): 9... [19:19:40] (03CR) 10Dzahn: [C: 032] "this was also to test that puppet-lint/jenkins still work after i merged https://gerrit.wikimedia.org/r/#/c/288629/" [puppet] - 10https://gerrit.wikimedia.org/r/322297 (owner: 10Dzahn) [19:21:44] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2806933 (10RobH) a:03RobH I'll steal this task for followup from Chris's findings and will update it accordingly. [19:22:41] (03CR) 10Dzahn: "tested with https://gerrit.wikimedia.org/r/#/c/322297/ that puppet-lint checks still work" [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [19:25:07] (03PS11) 10Dzahn: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:30:28] (03CR) 10Dzahn: "can you add a short description what they are for to the code in a comment? like "for PHPUNit tests" and the 2 bug numbers? then i'll merg" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:30:41] 06Operations: require_package fails to install packages if "++" appears in package name - https://phabricator.wikimedia.org/T125276#2806966 (10ori) 05Open>03Resolved a:05ori>03hashar Fixed by @hashar in 9c4a7e3e7d. (Thanks!) [19:31:08] 06Operations, 13Patch-For-Review: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2806975 (10ori) 05Open>03Resolved Nope. [19:32:57] (03PS1) 10Dzahn: Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 [19:33:13] (03CR) 10jenkins-bot: [V: 04-1] Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 (owner: 10Dzahn) [19:34:57] 06Operations: require_package fails to install packages if "++" appears in package name - https://phabricator.wikimedia.org/T125276#2806996 (10Dzahn) nice :) thanks! reverting https://gerrit.wikimedia.org/r/#/c/322302/ [19:35:08] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2806997 (10Cmjohnson) The size of the card it too large for the space inside the server. [19:35:26] (03PS12) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:38:31] (03PS13) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:39:06] (03PS14) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:39:38] (03PS15) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:40:25] (03PS16) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:41:18] (03PS17) 10Paladox: contint: add postgresql to contint::packages::postgresql [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [19:42:39] (03CR) 10Dzahn: [C: 032] "class is not included anywhere yet, on purpose, but at some point it will be needed, so going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:42:52] mutante: thanks! do yo uhave a link to the compiler? [19:42:53] ^^ thankyou mutante [19:43:18] godog: http://puppet-compiler.wmflabs.org/4613/ [19:44:06] mutante: thanks, my guess would be that's related to puppetdb [19:44:55] godog: *nod* [19:53:11] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2807048 (10Nuria) @Cmjohnson : will it fit in 1002? [20:03:09] (03PS2) 10Dzahn: Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 [20:03:16] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2807065 (10RobH) stat1002 is an R510 3 PCIe G2 slots + 1 storage slot: One x8 slot Two x4 slots One Storage x4 slot So its appea... [20:05:00] (03PS1) 10Rush: labsdb: create-dbusers interval handling [puppet] - 10https://gerrit.wikimedia.org/r/322307 [20:05:12] (03PS3) 10Dzahn: Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 [20:05:29] (03PS4) 10Dzahn: Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 [20:06:02] (03PS2) 10Rush: labsdb: create-dbusers interval handling [puppet] - 10https://gerrit.wikimedia.org/r/322307 [20:10:15] (03CR) 10Dzahn: [C: 032] Revert "visualdiff: do not install g++ package, debugging" [puppet] - 10https://gerrit.wikimedia.org/r/322302 (owner: 10Dzahn) [20:10:22] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2730983 (10chasemp) No objection here, {T146150} already existed. [20:10:26] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2807089 (10RobH) In fact, it seems that even if there is room in the R510, it lacks the power cables for PCI slot use. Some quick checking online fin... [20:10:43] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2807092 (10chasemp) [20:12:12] (03CR) 10Rush: [C: 032] labsdb: create-dbusers interval handling [puppet] - 10https://gerrit.wikimedia.org/r/322307 (owner: 10Rush) [20:12:16] (03PS3) 10Rush: labsdb: create-dbusers interval handling [puppet] - 10https://gerrit.wikimedia.org/r/322307 [20:12:39] (03CR) 10Rush: [V: 032] labsdb: create-dbusers interval handling [puppet] - 10https://gerrit.wikimedia.org/r/322307 (owner: 10Rush) [20:13:07] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2807097 (10RobH) a:05RobH>03DarTar I'm assigning this back to @dartar as the original requestor, since it seems we cannot accommodate the request... [20:21:28] deploying mobileapps in a few minutes absent any objectsion [20:21:32] *objections [20:21:44] mdholloway: uh, why on a Friday? [20:21:52] mdholloway: default objection. [20:22:52] greg-g: just want to deploy a small change to a new "announcements" endpoint so that it can be tested live in the apps next week [20:22:57] no [20:23:00] sorry [20:23:06] Do it monday. [20:23:14] mdholloway: monday is fine [20:23:27] greg-g: jsut wasnt sure if we could get anything in on monday [20:23:30] guys, you know there are no deploys on Fridays. [20:23:42] greg-g: because of the holiday [20:23:58] I'm going afk, my laptop is dying and I just finished a keynote, /me goes [20:24:13] greg-g: yeah, i was just thinking all deployments were verboten next week. [20:24:39] but monday is fine with me if it's fine with you. [20:24:41] mdholloway: send me an email describing what it is and I'll reply there [20:25:14] coreyfloyd: you mind doing the honors on that? [20:25:38] mdholloway: sure [20:26:35] coreyfloyd: thx [20:30:54] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2807112 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Yo... [20:41:00] (03CR) 10Hashar: "Thanks :]" [puppet] - 10https://gerrit.wikimedia.org/r/288629 (owner: 10Hashar) [20:44:57] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2807157 (10chasemp) @Kelson we did a capacity audit and spoke today about what is possible. We are going to create a custom flavor for the wmoffliner project that is an XL VM with an... [20:45:17] (03PS1) 10Eevans: bootstrap restbase2010-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322310 (https://phabricator.wikimedia.org/T151086) [20:54:03] (03CR) 10Filippo Giunchedi: [C: 032] bootstrap restbase2010-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/322310 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [20:56:24] godog: gracias [20:57:09] urandom: yw! seems to be working [20:57:19] godog: but of course! [20:57:25] godog: when does it not? :) [20:57:39] * urandom knocks on wood, furiously [20:57:40] haha when I jinx it like now :( [21:00:12] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused [21:01:22] ^^^ got it [21:01:49] yup, I'm going to lunch but ping me if needed [21:02:25] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused eevans Bootstrapping [21:02:42] kk [21:06:32] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked), 07Wikimedia-Incident: setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2807214 (10RobH) The SSDs in restbase are typically placed into a large raid0. Please copy the partitioning of restbase-test200* for these. [21:21:02] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2807242 (10Andrew) 05stalled>03Resolved I've adjusted quotas to allow for creation of two xlarge instances, and added a special flavor called xlarge-xtradisk for your EN instance. [21:25:17] (03Draft1) 10Paladox: Phabricator: fix Class[Exim4] is already declared error [puppet] - 10https://gerrit.wikimedia.org/r/322351 [21:25:19] (03Draft2) 10Paladox: Phabricator: fix Class[Exim4] is already declared error [puppet] - 10https://gerrit.wikimedia.org/r/322351 [21:27:03] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [21:27:56] (03CR) 10Andrew Bogott: [C: 031] "f-f-f-format!" [puppet] - 10https://gerrit.wikimedia.org/r/322149 (owner: 10Rush) [21:28:06] (03CR) 10Chad: [C: 031] Phabricator: fix Class[Exim4] is already declared error [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [21:31:02] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2807262 (10faidon) Ah! Sorry for missing that! @chasemp, is that something that the #Labs team can and/or will do? [21:32:06] !log CI / Zuul slightly overloaded. Will resolve by itself soon. [21:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:49] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:01] (03PS1) 10BBlack: New GlobalSign unified, 2016-11-21 start date [puppet] - 10https://gerrit.wikimedia.org/r/322353 [21:46:13] (03CR) 1020after4: [C: 031] Phabricator: fix Class[Exim4] is already declared error [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [21:46:27] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2807288 (10chasemp) sure yeah, it's been hellfire and brimstone for a bit here recently. Post-thanksgiving I expect? We are still untangling knots from (stage 1) of the storage migration, madhu is gone for a month and e... [21:48:48] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [21:55:13] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2807298 (10faidon) Sure, I'm just making sure that we're not waiting for each other :) [21:56:05] (03PS2) 10BBlack: New GlobalSign unified, 2016-11-21 start date [puppet] - 10https://gerrit.wikimedia.org/r/322353 (https://phabricator.wikimedia.org/T149858) [21:57:42] (03CR) 10BBlack: [C: 032] "Note this commit does not deploy them for use. Manually verified only so far." [puppet] - 10https://gerrit.wikimedia.org/r/322353 (https://phabricator.wikimedia.org/T149858) (owner: 10BBlack) [22:01:17] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:08:57] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:23:16] (03CR) 10Dzahn: [C: 04-1] "This works in production but fails in labs when the hiera setting "standard::has_default_mail_relay" is missing or set to "true"." [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [22:30:17] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:31:38] (03CR) 10Dzahn: "let's put it in hieradata/labs/phabricator/ in repo, rather than the wiki page" [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [22:32:09] (03PS1) 10Paladox: Add standard::has_default_mail_relay to phabricator hieradata labs class [puppet] - 10https://gerrit.wikimedia.org/r/322357 [22:32:14] mutante ^^ [22:32:14] :) [22:32:29] (03PS2) 10Paladox: Add standard::has_default_mail_relay to phabricator hieradata labs class [puppet] - 10https://gerrit.wikimedia.org/r/322357 [22:34:33] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/322357/ should be done instead yet, yea" [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [22:35:03] (03Abandoned) 10Paladox: Phabricator: fix Class[Exim4] is already declared error [puppet] - 10https://gerrit.wikimedia.org/r/322351 (owner: 10Paladox) [22:36:08] (03CR) 10Dzahn: [C: 031] "yes, this should be the proper solution to fix the issue from https://gerrit.wikimedia.org/r/#/c/322351/." [puppet] - 10https://gerrit.wikimedia.org/r/322357 (owner: 10Paladox) [22:40:17] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:49:40] (03PS3) 10Paladox: Phabricator: Add standard::has_default_mail_relay to phabricator hieradata labs class [puppet] - 10https://gerrit.wikimedia.org/r/322357 [22:50:35] (03PS4) 10Paladox: Phabricator: Set this standard::has_default_mail_relay to false in phabricator hieradata labs class [puppet] - 10https://gerrit.wikimedia.org/r/322357 [22:55:01] (03CR) 10Dzahn: [C: 032] "yes, it's set to false in prod and Paladox already set it in the Hiera: labs page. Let's add it here permanently." [puppet] - 10https://gerrit.wikimedia.org/r/322357 (owner: 10Paladox) [22:55:18] mutante ^^ thanks :) [23:08:17] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:13:21] (03PS1) 10Volans: Puppet merge: molly-guard multiple commits [puppet] - 10https://gerrit.wikimedia.org/r/322362 [23:36:46] 06Operations, 10Domains, 10Phabricator, 10Traffic: short URL for phabricator - https://phabricator.wikimedia.org/T151094#2807491 (10Paladox) [23:39:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:40:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:41:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:43:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:44:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:47:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:48:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:56:46] (03CR) 10Aaron Schulz: Remove FlaggedRevs autopromotion function at eowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322262 (https://phabricator.wikimedia.org/T150591) (owner: 10MarcoAurelio) [23:58:04] (03PS1) 10Filippo Giunchedi: Move hhvm_exporter to its own package [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/322371 [23:58:06] (03PS1) 10Filippo Giunchedi: Debian packaging [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/322372