[00:01:16] (03PS5) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [00:01:37] !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.2 [00:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:59] (03CR) 10jerkins-bot: [V: 04-1] Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [00:03:52] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4179231 (101233thehongkonger) Well, it seems that not only accounts that are based in enwp being brute-forced, but also in zhwp. [00:05:21] (03CR) 10jenkins-bot: Various pylint fixes to scap plugins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430825 (owner: 10Chad) [00:05:25] (03CR) 10jenkins-bot: Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 (owner: 10Chad) [00:05:31] (03CR) 10jenkins-bot: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430827 (owner: 10Chad) [00:17:25] (03CR) 10BryanDavis: [C: 031] mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [00:42:03] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481#4180796 (10dduvall) [00:45:22] no_justification: Are you done? [00:45:27] I have a patch to SWAT. [00:45:33] Technically no [00:45:49] But go ahead [00:46:09] Okay. [00:47:09] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) (owner: 10Niharika29) [00:48:35] (03Merged) 10jenkins-bot: Up the config temporarily to prevent loginnotify fail attempt emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) (owner: 10Niharika29) [00:48:44] AndyRussG: Still want your patch SWATted? [00:49:43] (03CR) 10jenkins-bot: Up the config temporarily to prevent loginnotify fail attempt emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) (owner: 10Niharika29) [00:50:49] Niharika: hi! yeah that'd be great :) [00:50:54] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Temporarily add a higher threshold to trigger login attempt notices T193762 (duration: 01m 17s) [00:50:56] it's incredibly safe [00:50:58] (03PS2) 10Niharika29: Turn off CentralNotice EventLogging impression data following test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430649 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [00:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:02] AndyRussG: On it. [00:51:08] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430649 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [00:51:28] Niharika: thanks! :) [00:52:54] (03Merged) 10jenkins-bot: Turn off CentralNotice EventLogging impression data following test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430649 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [00:55:09] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Temporary low-level activation of Eventlogging impression data for testing T183978 (duration: 01m 16s) [00:55:12] AndyRussG: It's live now. [00:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:13] T183978: [Epic] Kafkatee changes - https://phabricator.wikimedia.org/T183978 [00:55:15] no_justification: I'm done. [00:55:40] (03CR) 10jenkins-bot: Turn off CentralNotice EventLogging impression data following test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430649 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [00:56:45] Niharika: okok checking [00:57:51] Niharika: lgtm! :) [00:58:01] Cool. :) [00:58:19] thanks much :) [02:57:14] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4179231 (10Lnnocentius) >>! In T193769#4180748, @1233thehongkonger wrote: > Well, it seems that not only accounts that are based in enwp being brute-forced, but also in zhwp. Just a r... [03:05:58] (03PS1) 10Chad: scap patch: Some minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430842 [03:08:38] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4180953 (10Xaosflux) Graph of authentication - showing it is still occuring: https://grafana.wikimedia.org/dashboard/db/authentication-metrics?orgId=1&from=1525275869702&to=now&theme=d... [03:12:54] (03PS1) 10Chad: All kinds of py cleanup to swat.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430843 [03:14:00] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4180966 (10EtaoinWu) Since the crack started, the CAPTCHA error rate was high. However, at about 5/3 18:30 UTC, the CAPTCHA error rate suddenly falls (from almost 100% to a normal rate... [03:16:25] (03CR) 10Chad: [C: 032] All kinds of py cleanup to swat.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430843 (owner: 10Chad) [03:16:27] (03CR) 10Chad: [C: 032] scap patch: Some minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430842 (owner: 10Chad) [03:17:09] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate - https://phabricator.wikimedia.org/T193829#4180991 (10Xaosflux) [03:17:52] (03Merged) 10jenkins-bot: scap patch: Some minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430842 (owner: 10Chad) [03:18:29] (03Merged) 10jenkins-bot: All kinds of py cleanup to swat.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430843 (owner: 10Chad) [03:18:40] (03CR) 10Chad: "How does that import sources thing work anyway? They're not relative to the docroot...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [03:19:02] (03PS3) 10Chad: Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 [03:19:39] (03CR) 10jenkins-bot: scap patch: Some minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430842 (owner: 10Chad) [03:23:20] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate - https://phabricator.wikimedia.org/T193829#4181022 (10Xaosflux) Regenerated botpassword, verified that web logon was working for account - no change [03:25:23] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={create_container,podsandbox_status,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:26:32] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:27:14] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate - https://phabricator.wikimedia.org/T193829#4181031 (10Xaosflux) Note: after regeneration of bot password, that account can now logon, second account that was not regenerated is unable to logon still [03:27:37] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate with existing botpassword - https://phabricator.wikimedia.org/T193829#4181033 (10Xaosflux) [03:27:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 842.72 seconds [03:44:54] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4179231 (10Jalexander) >>! In T193769#4180966, @EtaoinWu wrote: > Since the crack started, the CAPTCHA error rate was high. > However, at about 5/3 18:30 UTC, the CAPTCHA error rate su... [03:53:42] (03CR) 10Chad: [C: 032] Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 (owner: 10Chad) [03:54:59] (03PS2) 10Chad: multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 [03:55:03] (03Merged) 10jenkins-bot: Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 (owner: 10Chad) [03:55:16] !log demon@tin Synchronized scap/plugins: No-op plugin style fixes (duration: 01m 11s) [03:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:21] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: disabling LQT on a few closed/unloved testwikis (duration: 01m 11s) [04:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:05] (03CR) 10Chad: [C: 032] multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [04:14:29] (03PS1) 10Chad: group2 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430844 [04:16:08] (03Merged) 10jenkins-bot: multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [04:17:47] !log demon@tin Synchronized multiversion/getMWVersion: clean up getRealmSpecificFilename() (duration: 01m 07s) [04:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:09] (03CR) 10Chad: [C: 032] group2 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430844 (owner: 10Chad) [04:21:05] (03Merged) 10jenkins-bot: group2 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430844 (owner: 10Chad) [04:24:39] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.2 [04:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:02] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate with existing botpassword - https://phabricator.wikimedia.org/T193829#4181065 (10Xaosflux) p:05High>03Normal Some users are reporting success now - please monitor [04:26:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 218.33 seconds [04:53:30] (03CR) 10BryanDavis: "The package builds cleanly using pdebuild on tools-package-builder-01. I'm not quite sure what the test is complaining about." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [05:21:42] !log Deploy schema change on dbstore1002:s4 - T191519 T188299 T190148 [05:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:49] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:21:49] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:21:50] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:25:19] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T193747#4181121 (10Marostegui) 05Open>03Resolved All fixed! ``` root@db1063:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [05:28:46] (03PS1) 10Marostegui: s4.hosts: Move db1102 a bit higher in the list [software] - 10https://gerrit.wikimedia.org/r/430845 [05:30:14] (03CR) 10Marostegui: [C: 032] s4.hosts: Move db1102 a bit higher in the list [software] - 10https://gerrit.wikimedia.org/r/430845 (owner: 10Marostegui) [05:30:59] (03Merged) 10jenkins-bot: s4.hosts: Move db1102 a bit higher in the list [software] - 10https://gerrit.wikimedia.org/r/430845 (owner: 10Marostegui) [05:34:22] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4181132 (10Marostegui) 05Open>03Resolved Let's create a general task if this happens again and consider this fixed for now and for this host. [05:38:14] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1060 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430846 [05:42:22] 10Operations, 10ops-eqiad, 10DBA: Move db1067 to row C - https://phabricator.wikimedia.org/T193835#4181136 (10Marostegui) [05:43:00] 10Operations, 10ops-eqiad, 10DBA: Move db1067 to row C - https://phabricator.wikimedia.org/T193835#4181148 (10Marostegui) p:05Triage>03Normal [05:44:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarify db1060 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430846 (owner: 10Marostegui) [05:45:27] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1060 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430846 (owner: 10Marostegui) [05:46:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Clarify that db1060 will be decommissioned (duration: 00m 53s) [05:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:39] (03CR) 10Marostegui: [C: 031] mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [06:04:35] (03PS1) 10Marostegui: mariadb: Add db1119 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/430847 (https://phabricator.wikimedia.org/T192979) [06:05:54] (03CR) 10Marostegui: [C: 032] mariadb: Add db1119 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/430847 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:07:32] (03PS1) 10Marostegui: s1.hosts: Add db1119 to s1 [software] - 10https://gerrit.wikimedia.org/r/430848 (https://phabricator.wikimedia.org/T192979) [06:10:51] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db1119 to s1 [software] - 10https://gerrit.wikimedia.org/r/430848 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:12:07] (03Merged) 10jenkins-bot: s1.hosts: Add db1119 to s1 [software] - 10https://gerrit.wikimedia.org/r/430848 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:31:09] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430850 (https://phabricator.wikimedia.org/T192979) [06:31:47] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/update-motd.d/97-last-puppet-run] [06:31:57] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:33:07] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:33:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430850 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:35:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430850 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:37:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T192979 (duration: 01m 11s) [06:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] T192979: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979 [06:42:16] !log Stop MySQL on db1066 to clone db1119 - T192979 [06:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:18] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4179231 (10Bencemac) [[ https://hu.wikipedia.org/wiki/Wikip%C3%A9dia:Kocsmafal_(m%C5%B1szaki)#Failed_attempt_to_log_in_to_your_account | Spotted ]] by the Hungarian Community too. [06:55:48] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1119 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430852 (https://phabricator.wikimedia.org/T192979) [06:56:28] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:28] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:54] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1119 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430852 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [07:00:38] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:01:44] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1119 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430852 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [07:02:48] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:04:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1119 to the config (duration: 01m 06s) [07:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:29] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1119 to the config (duration: 01m 20s) [07:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:20] !log reimaging mw2243, mw2247, mw2248 (job runners) to stretch [07:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:46] (03CR) 10Muehlenhoff: [C: 031] add mwmaint1001 to scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/430521 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [07:12:47] (03CR) 10Muehlenhoff: network: add mwmaint1001 to network constants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [07:16:14] (03CR) 10Muehlenhoff: tcpircbot: add mwmaint1001 to ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [07:30:08] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [07:57:04] 10Operations, 10monitoring: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4181246 (10fgiunchedi) That's the current behavior of the check, i.e. when things are ok exit 0 and no output. We can change it to print "OK" or sth similar, and the values/thresholds per... [08:01:48] PROBLEM - HHVM jobrunner on mw2247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:28] PROBLEM - Nginx local proxy to apache on mw2248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:08] ^ silencing [08:06:07] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4181262 (10Yann) FYI, there were several attempts yesterday, and again today, both on the English Wikipedia. [08:08:36] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4181267 (10fgiunchedi) Thanks for kickstarting this! +1, having syslogs in ELK would be very useful indeed. Some partial answers to the things to figure out: * Capacity - I chatted with @gehel at the last ops friday hangout about... [08:10:58] RECOVERY - HHVM jobrunner on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.136 second response time [08:11:48] (03CR) 10MarcoAurelio: idwikimedia: initial configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [08:13:28] RECOVERY - Nginx local proxy to apache on mw2248 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 1.145 second response time [08:14:45] (03PS6) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [08:14:59] (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1119 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430856 (https://phabricator.wikimedia.org/T192979) [08:16:42] (03PS1) 10Marostegui: db1119.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/430857 [08:20:15] (03CR) 10Marostegui: [C: 032] db1119.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/430857 (owner: 10Marostegui) [08:21:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly pool db1119 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430856 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [08:22:44] 10Operations, 10cloud-services-team, 10monitoring: Prometheus vs. CPU usage vs. hyperthreading - https://phabricator.wikimedia.org/T193272#4181274 (10fgiunchedi) In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in `/proc/stat... [08:22:48] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1119 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430856 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [08:24:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool recently new cloned db1119 (duration: 01m 00s) [08:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4181302 (10Marostegui) >>! In T193325#4177621, @jcrespo wrote: > The check detected some difference, but they could be false positives, checking again.... [08:29:41] !log mobrovac@tin Started deploy [cpjobqueue/deploy@193cf6f]: Config: Exclude refreshLinks from the RegEx rule [08:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4181303 (10Marostegui) p:05High>03Normal [08:30:27] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@193cf6f]: Config: Exclude refreshLinks from the RegEx rule (duration: 00m 47s) [08:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:26] (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430859 [08:36:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430859 (owner: 10Marostegui) [08:38:52] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430859 (owner: 10Marostegui) [08:40:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1119 in s1 API (duration: 00m 59s) [08:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:39] !log reimaging mw2249, mw2250, mw2253 (job runners) to stretch [08:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:45] (03PS7) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [08:49:43] (03CR) 10ArielGlenn: "I think you can just skip the inclusion of the packages::php5 and packages::php7 classes completely. They are picked up by mediawiki::php " [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [08:50:30] (03PS8) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [08:51:45] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430860 [08:53:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430860 (owner: 10Marostegui) [08:55:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430860 (owner: 10Marostegui) [08:57:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1119 (duration: 01m 07s) [08:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:17] (03PS1) 10Marostegui: db-eqiad.php: Give more API traffic to db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430861 [09:05:22] (03CR) 10Jcrespo: [C: 031] "This is ok, but needs manual deploy to the master on merge." [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [09:05:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more API traffic to db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430861 (owner: 10Marostegui) [09:07:10] (03Merged) 10jenkins-bot: db-eqiad.php: Give more API traffic to db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430861 (owner: 10Marostegui) [09:08:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase API traffic for db1119 (duration: 00m 59s) [09:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4181389 (10jcrespo) Yes, no differences were assured on a second run. I will repool the server now. [09:14:27] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430863 [09:16:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430863 (owner: 10Marostegui) [09:17:27] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430863 (owner: 10Marostegui) [09:18:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1119 in s1 API (duration: 00m 59s) [09:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:21] (03PS2) 10Jcrespo: mariadb: Remove mediawiki references to db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430599 (https://phabricator.wikimedia.org/T193736) [09:20:20] (03CR) 10jenkins-bot: All kinds of py cleanup to swat.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430843 (owner: 10Chad) [09:20:25] (03CR) 10jenkins-bot: Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 (owner: 10Chad) [09:20:30] (03CR) 10jenkins-bot: multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [09:20:38] (03CR) 10jenkins-bot: group2 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430844 (owner: 10Chad) [09:20:41] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1060 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430846 (owner: 10Marostegui) [09:20:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430850 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [09:20:53] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1119 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430852 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [09:20:57] (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1119 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430856 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [09:21:03] (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430859 (owner: 10Marostegui) [09:21:08] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430860 (owner: 10Marostegui) [09:21:15] (03CR) 10jenkins-bot: db-eqiad.php: Give more API traffic to db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430861 (owner: 10Marostegui) [09:21:23] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1119 in s1 API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430863 (owner: 10Marostegui) [09:21:32] (03CR) 10Mobrovac: "Should we get this out finally?" [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [09:23:50] (03CR) 10Jcrespo: [C: 032] mariadb: Remove mediawiki references to db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430599 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [09:26:17] (03Merged) 10jenkins-bot: mariadb: Remove mediawiki references to db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430599 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [09:30:29] PROBLEM - High CPU load on API appserver on mw2253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:31:46] ^ reimage, silencing [09:33:19] PROBLEM - HHVM jobrunner on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:28] (03PS8) 10Rduran: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [09:35:30] (03PS1) 10Rduran: [WIP] Use Cumin to implement the comunication for the transfer [puppet] - 10https://gerrit.wikimedia.org/r/430868 [09:39:11] !log uploaded openjdk-8 8u171-b11 for jessie-wikimedia to apt.wikimedia.org [09:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:07] 10Operations, 10ops-eqiad, 10DBA: Move db1066 to row A - https://phabricator.wikimedia.org/T193847#4181453 (10Marostegui) [09:40:27] 10Operations, 10ops-eqiad, 10DBA: Move db1066 to row A - https://phabricator.wikimedia.org/T193847#4181465 (10Marostegui) p:05Triage>03Normal [09:40:58] PROBLEM - Nginx local proxy to apache on mw2249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:15] (03PS1) 10Ema: varnishlogconsumer: restart services on file change [puppet] - 10https://gerrit.wikimedia.org/r/430869 [09:43:17] (03PS1) 10Ema: varnishlogconsumer: do not install python-logstash [puppet] - 10https://gerrit.wikimedia.org/r/430870 [09:43:34] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4181468 (10Elitre) [09:45:38] RECOVERY - HHVM jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.101 second response time [09:47:09] RECOVERY - Nginx local proxy to apache on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.154 second response time [09:52:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1056 (duration: 01m 00s) [09:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:20] (03CR) 10Vgutierrez: [C: 031] "We introduced a slightly change on how we manage python3-logstash installation with the varnishlog refactor. Now it's gets installed on fi" [puppet] - 10https://gerrit.wikimedia.org/r/430870 (owner: 10Ema) [09:56:49] (03CR) 10Vgutierrez: [C: 031] varnishlogconsumer: restart services on file change [puppet] - 10https://gerrit.wikimedia.org/r/430869 (owner: 10Ema) [09:57:57] (03CR) 10Ema: [C: 032] varnishlogconsumer: restart services on file change [puppet] - 10https://gerrit.wikimedia.org/r/430869 (owner: 10Ema) [09:58:07] (03CR) 10Ema: [C: 032] varnishlogconsumer: do not install python-logstash [puppet] - 10https://gerrit.wikimedia.org/r/430870 (owner: 10Ema) [09:59:51] (03PS2) 10Rduran: [WIP] Use Cumin to implement the comunication for the transfer [puppet] - 10https://gerrit.wikimedia.org/r/430868 [10:03:39] (03CR) 10jenkins-bot: mariadb: Remove mediawiki references to db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430599 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [10:04:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.094 second response time [10:07:21] (03CR) 10Muehlenhoff: Allow removing Diamond gradually (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [10:07:23] (03PS3) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) [10:09:34] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.087 second response time [10:13:54] RECOVERY - High CPU load on API appserver on mw2253 is OK: OK - load average: 9.51, 10.69, 6.14 [10:21:54] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.126 second response time [10:24:32] (03CR) 10Mobrovac: [C: 031] "Applied on scb2001 manually, all good." [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [10:26:54] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.114 second response time [10:32:59] (03CR) 10Ema: "> Do we have prometheus monitoring to replace these?" [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [10:36:03] (03PS1) 10Marostegui: dbstore.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/430875 (https://phabricator.wikimedia.org/T150949) [10:38:27] (03PS2) 10Marostegui: dbstore_multiinstance.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/430875 (https://phabricator.wikimedia.org/T150949) [10:39:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.107 second response time [10:40:13] (03PS1) 10Arturo Borrero Gonzalez: d/service: fix ExecStart call [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430876 (https://phabricator.wikimedia.org/T188392) [10:41:34] (03CR) 10Muehlenhoff: [C: 031] d/service: fix ExecStart call [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430876 (https://phabricator.wikimedia.org/T188392) (owner: 10Arturo Borrero Gonzalez) [10:42:26] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4181629 (10aborrero) >>! In T188392#4175590, @chasemp wrote: > fyi on labtestneutron2001 atm > > ```root@labtestneutron2001:~# systemctl list-units --state=failed... [10:43:07] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/service: fix ExecStart call [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430876 (https://phabricator.wikimedia.org/T188392) (owner: 10Arturo Borrero Gonzalez) [10:46:02] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.3 jessie-wikimedia [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430878 [10:46:49] (03CR) 10Jcrespo: [C: 031] dbstore_multiinstance.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/430875 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [10:47:51] (03CR) 10Marostegui: [C: 032] dbstore_multiinstance.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/430875 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [10:51:58] !log reimaging mw2153, mw2154, mw2155 (job runners) to stretch [10:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:07] (03PS1) 10Aklapper: Fix wrong link to Server Admin Log on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) [10:54:34] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.115 second response time [10:55:48] !log Manually enable innodb_strict_mode just on dbstore2001:3315 - T150949 [10:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] T150949: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949 [11:10:22] (03PS2) 10R4q3NWnUx2CEhVyr: Allocate only the needed size for the format structure array [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 [11:13:38] (03PS1) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [11:14:14] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:15:35] (03PS1) 10Arturo Borrero Gonzalez: d/: drop upstart file [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430882 (https://phabricator.wikimedia.org/T188392) [11:16:20] (03PS2) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [11:17:00] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:18:28] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/: drop upstart file [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430882 (https://phabricator.wikimedia.org/T188392) (owner: 10Arturo Borrero Gonzalez) [11:18:47] (03PS3) 10Volans: debmonitor: add server side puppettization [puppet] - 10https://gerrit.wikimedia.org/r/430881 (https://phabricator.wikimedia.org/T191299) [11:20:55] (03PS2) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.3 jessie-wikimedia [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430878 [11:21:15] (03PS1) 10Jcrespo: mariadb: Decommission db1056 [puppet] - 10https://gerrit.wikimedia.org/r/430884 (https://phabricator.wikimedia.org/T193736) [11:22:28] (03PS1) 10Jcrespo: dbhosts: Remove db1056 for decom [software] - 10https://gerrit.wikimedia.org/r/430885 (https://phabricator.wikimedia.org/T193736) [11:22:58] !log stopping db1056 and moving it to spare [11:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:57] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/changelog: generate entry for 0.3 jessie-wikimedia [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430878 (owner: 10Arturo Borrero Gonzalez) [11:26:42] (03CR) 10Jcrespo: [C: 032] dbhosts: Remove db1056 for decom [software] - 10https://gerrit.wikimedia.org/r/430885 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [11:26:56] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1056 [puppet] - 10https://gerrit.wikimedia.org/r/430884 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [11:36:11] (03CR) 10Volans: "@hashar: did you had a chance to test this?" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [11:41:08] (03PS1) 10Muehlenhoff: Switch video scalers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/430892 [11:42:04] 10Operations, 10ops-eqiad: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902#4181812 (10Cmjohnson) @MoritzMuehlenhoff The error has not returned, go ahead and re-install. The error was correctable, so moving and reseating may have fixed the issue. [11:44:28] 10Operations, 10ops-eqiad: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902#4181813 (10MoritzMuehlenhoff) Thanks, I'll reimage it on Monday. [11:45:06] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4181820 (10aborrero) New version `0.3` was added to `jessie-wikimedia`. [11:50:37] (03CR) 10Gehel: [C: 031] "I'll merge this next Monday. Not too keen on pushing it out on a Friday :)" [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [11:55:03] (03PS1) 10Gehel: maps: re-enable OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/430893 (https://phabricator.wikimedia.org/T191655) [11:56:05] (03PS1) 10Jcrespo: mariadb: Repool db2081 after crash & check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430894 (https://phabricator.wikimedia.org/T193325) [11:59:43] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2081 after crash & check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430894 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [12:01:15] (03Merged) 10jenkins-bot: mariadb: Repool db2081 after crash & check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430894 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [12:01:23] (03CR) 10Gehel: "Puppet compiler looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/430893 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [12:01:32] (03CR) 10jenkins-bot: mariadb: Repool db2081 after crash & check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430894 (https://phabricator.wikimedia.org/T193325) (owner: 10Jcrespo) [12:01:35] (03CR) 10Gehel: [C: 032] maps: re-enable OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/430893 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [12:02:03] PROBLEM - Nginx local proxy to apache on mw2154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:33] PROBLEM - mediawiki-installation DSH group on mw2155 is CRITICAL: Host mw2155 is not in mediawiki-installation dsh group [12:04:41] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2081 (duration: 00m 59s) [12:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:54] PROBLEM - Nginx local proxy to apache on mw2153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:34] PROBLEM - HHVM jobrunner on mw2155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:53] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: rabbitmq_exporter package was added to jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/430411 (https://phabricator.wikimedia.org/T188392) (owner: 10Arturo Borrero Gonzalez) [12:08:00] (03PS2) 10Arturo Borrero Gonzalez: openstack: rabbitmq_exporter package was added to jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/430411 (https://phabricator.wikimedia.org/T188392) [12:08:24] PROBLEM - mediawiki-installation DSH group on mw2154 is CRITICAL: Host mw2154 is not in mediawiki-installation dsh group [12:09:30] !log restarted Jenkins on contint1001 (java update) [12:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:04] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4181873 (10aborrero) 05Open>03Resolved [12:11:53] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [12:12:34] PROBLEM - HHVM jobrunner on mw2154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:33] PROBLEM - mediawiki-installation DSH group on mw2153 is CRITICAL: Host mw2153 is not in mediawiki-installation dsh group [12:13:53] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [12:15:14] PROBLEM - Nginx local proxy to apache on mw2155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:34] PROBLEM - HHVM jobrunner on mw2153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:49] ^ silencing [12:28:25] (03Abandoned) 10Jcrespo: mariadb: Backup new m5 databases striker and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/328476 (owner: 10Jcrespo) [12:36:24] 10Operations, 10Traffic: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865#4181953 (10ema) [12:36:33] 10Operations, 10Traffic: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865#4181964 (10ema) p:05Triage>03Normal [12:36:43] PROBLEM - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:38:44] (03PS1) 10Ema: numa_networking: enable on cache_canary and cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/430896 (https://phabricator.wikimedia.org/T193865) [12:39:13] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/rabbitmq-exporter.yaml] [12:42:37] (03PS2) 10Ema: numa_networking: enable on cache_canary and cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/430896 (https://phabricator.wikimedia.org/T193865) [12:46:44] (03CR) 10Muehlenhoff: [C: 031] "Ah, indeed. We only need to make the package name for readline dependant on the distro." [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [12:49:08] (03PS1) 10Filippo Giunchedi: prometheus_check_metric: print message when status is OK [puppet] - 10https://gerrit.wikimedia.org/r/430898 (https://phabricator.wikimedia.org/T193793) [12:49:14] (03CR) 10Ema: [C: 032] numa_networking: enable on cache_canary and cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/430896 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [12:53:55] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [13:11:05] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [13:15:25] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [13:19:58] (03PS1) 10Marostegui: mariadb: Enable innodb_strict_mode on labs and misc [puppet] - 10https://gerrit.wikimedia.org/r/430901 (https://phabricator.wikimedia.org/T150949) [13:21:18] 10Operations, 10AutoWikiBrowser, 10Bot-Frameworks, 10Huggle: API Logins are failing to authenticate with existing botpassword - https://phabricator.wikimedia.org/T193829#4180967 (10Anomie) If you change the main password on your account, that also invalidates all bot passwords. [13:30:12] (03CR) 10Jcrespo: [C: 031] mariadb: Enable innodb_strict_mode on labs and misc [puppet] - 10https://gerrit.wikimedia.org/r/430901 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [13:30:38] (03CR) 10Marostegui: [C: 032] mariadb: Enable innodb_strict_mode on labs and misc [puppet] - 10https://gerrit.wikimedia.org/r/430901 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [13:33:14] !log Manually enable innodb_strict_mode on labsdb1009 - T150949 [13:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:18] T150949: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949 [13:33:23] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.083 second response time [13:34:12] (03PS1) 10Ema: numa_networking: move hiera call to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 [13:34:48] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: move hiera call to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (owner: 10Ema) [13:35:54] 10Operations, 10monitoring, 10Patch-For-Review: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4182138 (10Dzahn) The screenshot above is from a time when a host being reinstalled and every other check on the host was red. Is it possible that it was actually ok... [13:38:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1956 bytes in 0.107 second response time [13:38:33] RECOVERY - HHVM jobrunner on mw2153 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time [13:39:47] (03PS1) 10Subramanya Sastry: Enable RemexHtml on more wikibooks wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430903 (https://phabricator.wikimedia.org/T192821) [13:41:23] RECOVERY - HHVM jobrunner on mw2155 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.085 second response time [13:43:44] RECOVERY - Nginx local proxy to apache on mw2153 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.235 second response time [13:45:59] RECOVERY - Nginx local proxy to apache on mw2155 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.154 second response time [13:47:29] RECOVERY - HHVM jobrunner on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.075 second response time [13:48:09] (03CR) 10Andrew Bogott: [C: 031] Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [13:48:17] 10Operations, 10monitoring, 10Patch-For-Review: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4182214 (10fgiunchedi) >>! In T193793#4182138, @Dzahn wrote: > The screenshot above is from a time when a host being reinstalled and every other check on the host wa... [13:48:20] RECOVERY - Nginx local proxy to apache on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.154 second response time [13:49:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4182219 (10jcrespo) 05Open>03Resolved [14:03:25] (03PS1) 10Vgutierrez: varnishtlsinspector: Stop collecting TLS data [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) [14:05:21] !log mw2189, mw2190, mw2192 - reinstall with stretch, mw2191 - puppet cert not found [14:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:56] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler02/11111/" [puppet] - 10https://gerrit.wikimedia.org/r/430892 (owner: 10Muehlenhoff) [14:12:54] (03PS1) 10Muehlenhoff: Remove support for trusty in mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/430912 [14:14:35] (03CR) 10Ema: [C: 04-1] varnishtlsinspector: Stop collecting TLS data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:15:11] 10Operations, 10Developer-Relations, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#4182281 (10Tgr) [14:15:55] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4182284 (10Marostegui) 05Open>03stalled [14:16:33] (03PS2) 10Vgutierrez: varnishtlsinspector: Stop collecting TLS data [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) [14:16:45] (03CR) 10Vgutierrez: varnishtlsinspector: Stop collecting TLS data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:23:36] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182294 (10Cmjohnson) @Vgutierrez I did what bblack suggested and switched the cables to the opposite card. Let's see if the magic works [14:25:20] (03PS1) 10Ema: Revert "numa_networking: enable on cache_canary and cache_misc" [puppet] - 10https://gerrit.wikimedia.org/r/430915 [14:25:42] (03PS2) 10Ema: Revert "numa_networking: enable on cache_canary and cache_misc" [puppet] - 10https://gerrit.wikimedia.org/r/430915 [14:25:50] (03CR) 10ArielGlenn: "I think this is ok; didn't we have a bit of an issue recently with php5 in use on jessie (maybe terbium)?" [puppet] - 10https://gerrit.wikimedia.org/r/430912 (owner: 10Muehlenhoff) [14:26:18] (03CR) 10Ema: [C: 032] Revert "numa_networking: enable on cache_canary and cache_misc" [puppet] - 10https://gerrit.wikimedia.org/r/430915 (owner: 10Ema) [14:27:46] (03CR) 10Muehlenhoff: "Yeah, but terbium and contint explicitly pull in mediawiki::packages::php5 (that's also why I didn't remove that class yet)" [puppet] - 10https://gerrit.wikimedia.org/r/430912 (owner: 10Muehlenhoff) [14:30:33] (03CR) 10ArielGlenn: [C: 031] "With that caveat, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/430912 (owner: 10Muehlenhoff) [14:32:41] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182305 (10Vgutierrez) Awesome, I just confirmed the new interface naming for lvs1016: * eth0 -> enp4s0f0 * eth1 -> enp4s0f1 * eth2 -> enp5s0f0 * eth3 -> enp5s0f1 [14:34:24] (03PS1) 10Muehlenhoff: Move scap proxy in C4 to mw2188 [puppet] - 10https://gerrit.wikimedia.org/r/430918 [14:34:58] (03PS1) 10Jcrespo: mariadb: Reenable notifications on several core hosts [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) [14:36:48] (03PS1) 10Arturo Borrero Gonzalez: admin: add new sudo alias in my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/430920 [14:37:00] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] admin: add new sudo alias in my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/430920 (owner: 10Arturo Borrero Gonzalez) [14:37:38] (03PS1) 10Addshore: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 [14:39:20] 10Operations, 10ops-eqiad: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T193757#4182321 (10Cmjohnson) 05Open>03declined There is already a task for this and the status is rebuild...declining this task [14:40:06] (03PS2) 10Addshore: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 [14:41:13] (03CR) 10Eevans: [C: 04-1] "I'd prefer we continue to postpone this, possibly until after an upgrade to 3.11.2 (https://phabricator.wikimedia.org/T178905). With ever" [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [14:44:11] (03PS1) 10Addshore: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 [14:44:12] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 60476 MB (12% inode=99%) [14:44:41] (03CR) 10jerkins-bot: [V: 04-1] Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 (owner: 10Addshore) [14:45:40] (03PS2) 10Addshore: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 [14:46:51] (03CR) 10Ema: [C: 031] varnishtlsinspector: Stop collecting TLS data [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:49:32] RECOVERY - Disk space on elastic1027 is OK: DISK OK [14:50:01] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4182332 (10Cmjohnson) [14:50:03] 10Operations, 10ops-eqiad: change hostname label for mw1297 to mwmaint1001 - https://phabricator.wikimedia.org/T193798#4182331 (10Cmjohnson) 05Open>03Resolved [14:52:30] !log disabling puppet in labvirt10[01-22] to deploy https://gerrit.wikimedia.org/r/#/c/430581/ and https://gerrit.wikimedia.org/r/#/c/430614/ T193657 [14:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:35] T193657: integrate nova.conf missing settings into neutron setup - https://phabricator.wikimedia.org/T193657 [14:55:37] (03PS5) 10Jcrespo: mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) (owner: 10Ladsgroup) [14:55:57] (03PS5) 10Addshore: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [14:55:59] (03PS1) 10Addshore: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 [14:56:01] (03PS1) 10Addshore: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 [14:58:01] (03PS2) 10Addshore: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 [14:58:06] (03PS6) 10Addshore: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [14:58:17] (03PS2) 10Addshore: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 [15:00:56] (03CR) 10Jcrespo: [C: 032] mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) (owner: 10Ladsgroup) [15:02:52] PROBLEM - Nginx local proxy to apache on mw2190 is CRITICAL: connect to address 10.192.32.78 and port 443: Connection refused [15:02:52] PROBLEM - HHVM rendering on mw2192 is CRITICAL: connect to address 10.192.32.80 and port 80: Connection refused [15:02:52] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2190 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:02:52] PROBLEM - nutcracker process on mw2192 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:33] RECOVERY - mediawiki-installation DSH group on mw2155 is OK: OK [15:04:32] PROBLEM - Check whether ferm is active by checking the default input chain on mw2190 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:39] got it ^ [15:05:05] !log disabling puppet in labcontrol1001 for T193657 [15:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:09] T193657: integrate nova.conf missing settings into neutron setup - https://phabricator.wikimedia.org/T193657 [15:06:37] !log disabling puppet in labnodepool100[1-2] for T193657 [15:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:50] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: nova.conf: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430581 (https://phabricator.wikimedia.org/T193657) (owner: 10Arturo Borrero Gonzalez) [15:06:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova.conf: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430581 (https://phabricator.wikimedia.org/T193657) [15:07:01] (03PS2) 10Arturo Borrero Gonzalez: openstack: api-paste.ini: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430614 (https://phabricator.wikimedia.org/T193657) [15:07:03] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: nova.conf: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430581 (https://phabricator.wikimedia.org/T193657) (owner: 10Arturo Borrero Gonzalez) [15:07:46] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: api-paste.ini: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430614 (https://phabricator.wikimedia.org/T193657) (owner: 10Arturo Borrero Gonzalez) [15:07:53] (03PS3) 10Arturo Borrero Gonzalez: openstack: api-paste.ini: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430614 (https://phabricator.wikimedia.org/T193657) [15:08:32] RECOVERY - mediawiki-installation DSH group on mw2154 is OK: OK [15:13:42] RECOVERY - mediawiki-installation DSH group on mw2153 is OK: OK [15:16:43] !log mw2194, mw2195, mw2196 - reinstall with stretch - mw2193 - puppet cert not found [15:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] (03PS2) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [15:17:19] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [15:17:51] (03CR) 10Muehlenhoff: [C: 04-1] "The Prometheus dashboard for Kafka only shows "no data points" for "nf_conntrack count" (while the Graphite version has the data), but hav" [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:20:29] (03PS3) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [15:25:25] (03CR) 10Hashar: "Almost:" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [15:26:26] (03PS4) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) [15:30:03] (03PS1) 10Vgutierrez: lvs: lvs1016.eqiad.wmnet configuration [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) [15:30:07] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:06] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:23] (03CR) 10Ema: "noop on cp1008 (flag toggled but only one numa domain), looks good on cp2006: https://puppet-compiler.wmflabs.org/compiler02/11119/" [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [15:37:14] (03PS2) 10Vgutierrez: lvs: lvs1016.eqiad.wmnet configuration [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) [15:49:57] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4182442 (10Deskana) a:05Deskana>03None Okay, I think we've got what we need now! Here's the 20 wikis they need: * Wikipedias: ** English ** Spanish ** German ** Russian ** Ja... [15:50:55] (03CR) 10Imarlier: "> Puppet is failing on graphite machines trying to remove coal user" [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [15:57:24] !log enabled puppet in labcontrol1001, labnodepol100[1-2] and labtestvirt10[01-22] after patches deployed for T193657 [15:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:28] T193657: integrate nova.conf missing settings into neutron setup - https://phabricator.wikimedia.org/T193657 [15:58:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: delete unused neutron nova.conf.erb template syntax [puppet] - 10https://gerrit.wikimedia.org/r/430931 [15:58:31] (03CR) 10BBlack: [C: 04-1] lvs: lvs1016.eqiad.wmnet configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [15:58:52] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182492 (10ayounsi) >>! In T184293#4182305, @Vgutierrez wrote: > Awesome, I just confirmed the new interface naming for lvs1016: > * eth0 -> enp4s0f0 > * eth1 -> enp4s0f1 > *... [15:59:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4175173 (10Bstorm) Looks like there are no alerts on this now, in checking the raid: ``` megacli -PDList -aALL | grep 'S.M.A.' Drive has flagged a S.M.A.R.T alert : No Driv... [16:00:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4182495 (10Bstorm) 05Open>03Resolved [16:05:22] Heads up, I'm going to do a patch deploy to MW related to logging stuff for the ongoing password bruteforcing attempts [16:06:36] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: delete unused neutron nova.conf.erb template syntax [puppet] - 10https://gerrit.wikimedia.org/r/430931 (owner: 10Arturo Borrero Gonzalez) [16:10:29] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:11:19] RECOVERY - puppet last run on labtestmetal2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:23] (03PS3) 10Vgutierrez: lvs: lvs1016.eqiad.wmnet configuration [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) [16:13:32] (03CR) 10Vgutierrez: lvs: lvs1016.eqiad.wmnet configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [16:15:58] (03PS1) 10Andrew Bogott: vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 [16:17:06] (03CR) 10BBlack: [C: 031] "looks good to human eyes!" [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [16:19:06] !log Logging adjustment in mediawiki for T193762 [16:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:19] (03CR) 10Andrew Bogott: [C: 04-1] "Argh, this doesn't work because of https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864074" [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:25:36] (03CR) 10Krinkle: Add .gitreview file (031 comment) [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/430306 (owner: 10Gilles) [16:25:49] !log adding BGP graceful shutdown to routers - T190323 [16:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:54] T190323: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323 [16:30:18] (03PS2) 10Andrew Bogott: vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 [16:31:23] (03PS3) 10Andrew Bogott: vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 [16:36:03] (03CR) 10Paladox: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:40:11] (03CR) 10Dzahn: "thanks! i'm just not sure.. does this mean you are going to do it later or that you expect me to do it?" [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [16:41:04] (03CR) 10Marostegui: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:43:12] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4182635 (10RobH) @MoritzMuehlenhoff: Is the above all that is needed to grant this access? I've not dealt with the google search console before, so I'm not sure how we should re... [16:43:19] (03CR) 10Awight: [C: 031] "Yes, please!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:43:49] (03CR) 10Dzahn: "it's because i commented the section in site.pp. checking that now" [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [16:44:13] (03CR) 10Herron: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:47:52] (03PS1) 10Dzahn: mwmaint: add mapped IPv6 address on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430939 (https://phabricator.wikimedia.org/T192092) [16:48:21] (03CR) 10Jcrespo: [C: 031] "+1 but being a system-wide change, I would seek large consensus (even if users can override it if they don't like it) and announce it on o" [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [16:49:38] (03CR) 10Vgutierrez: [C: 032] lvs: lvs1016.eqiad.wmnet configuration [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [16:49:44] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2196.codfw.wmnet [16:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:47] (03PS4) 10Vgutierrez: lvs: lvs1016.eqiad.wmnet configuration [puppet] - 10https://gerrit.wikimedia.org/r/430927 (https://phabricator.wikimedia.org/T184293) [16:50:12] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2194.codfw.wmnet [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:23] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2195.codfw.wmnet [16:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:29] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2194.codfw.wmnet [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2196.codfw.wmnet [16:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:26] (03PS2) 10Dzahn: mwmaint: add mapped IPv6 address on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430939 (https://phabricator.wikimedia.org/T192092) [16:52:52] * bawolff adjusting some login throttling code [16:53:23] (03CR) 10Dzahn: [C: 032] mwmaint: add mapped IPv6 address on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430939 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [16:53:27] (03PS3) 10Dzahn: mwmaint: add mapped IPv6 address on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430939 (https://phabricator.wikimedia.org/T192092) [16:57:10] !log adjusted login throttling code (T193762) [16:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:30] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4182673 (10Jgreen) p:05Triage>03Unbreak! [17:05:44] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [17:05:59] ^^ that's me [17:07:02] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4182693 (10RobH) Please note I'm attempting to triage and process this as part of my duties this week as SRE clinic duty. As such, I may make a mistake below, and apologize in a... [17:07:30] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=40) [17:09:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 76 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:12:09] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4182701 (10Jgreen) a:03Papaul [17:12:36] (03PS1) 10Ayounsi: Depool eqsin because of router issue [dns] - 10https://gerrit.wikimedia.org/r/430940 [17:13:20] (03PS2) 10Ottomata: Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) [17:14:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:14:45] (03CR) 10Ayounsi: [C: 032] Depool eqsin because of router issue [dns] - 10https://gerrit.wikimedia.org/r/430940 (owner: 10Ayounsi) [17:15:16] (03PS1) 10Vgutierrez: pybal: Set lvs1016 bgp peer address [puppet] - 10https://gerrit.wikimedia.org/r/430941 (https://phabricator.wikimedia.org/T184293) [17:16:06] !log depolled eqsin [17:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:21] (03CR) 10BryanDavis: [C: 031] "Needs SWAT or to be added to the train by RelEng." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430879 (https://phabricator.wikimedia.org/T193848) (owner: 10Aklapper) [17:18:30] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430942 (https://phabricator.wikimedia.org/T174047) [17:18:32] (03PS4) 10Dzahn: mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [17:20:15] (03CR) 10Vgutierrez: [C: 032] pybal: Set lvs1016 bgp peer address [puppet] - 10https://gerrit.wikimedia.org/r/430941 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [17:20:21] (03PS2) 10Vgutierrez: pybal: Set lvs1016 bgp peer address [puppet] - 10https://gerrit.wikimedia.org/r/430941 (https://phabricator.wikimedia.org/T184293) [17:26:26] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4182742 (10RobH) I've emailed out to the SRE team in an attempt to hammer down the details/process for these requests. [17:30:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [17:32:37] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 40 connections established with conf1001.eqiad.wmnet:2379 (min=40) [17:35:59] (03CR) 10Dzahn: "looks like that is not the case though, when removing the include line for php5 packages i do get a diff in compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [17:36:19] (03CR) 10Dzahn: [C: 04-1] mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [17:37:47] 10Operations, 10Traffic: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4182796 (10ayounsi) [17:38:39] robh: since you're on clinic duty, are you the right person to ask about creating mailing lists? https://phabricator.wikimedia.org/T192865 [17:39:11] hrmm [17:39:18] i dont think we import from third party lists [17:39:21] at least never have in past. [17:39:24] (that im aware) [17:40:24] as a policy reason or is there something technical preventing it? [17:41:07] (03CR) 10Vgutierrez: [C: 032] varnishtlsinspector: Stop collecting TLS data [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [17:41:11] also importing the archives was an extra thing, so if that's not possible it's totally fine [17:41:14] (03PS3) 10Vgutierrez: varnishtlsinspector: Stop collecting TLS data [puppet] - 10https://gerrit.wikimedia.org/r/430911 (https://phabricator.wikimedia.org/T193376) [17:41:33] Oh, I have no idea of the technical limitations, it just importing a list from another company/service to our own seems odd. [17:41:41] im not sure if it has any kind of legal implications. [17:41:58] the other part (making a list for use on our server) is easy. [17:42:16] so i can do that, and then the import part can wait for later, though if we import AFTER the list has an index [17:42:20] it will reindex the mbox [17:42:25] which can be annoying for linking. [17:42:37] as long as thats not an issue, i think we can always import later. [17:42:47] (in particualr if its a known thing from the get go) [17:43:01] (03PS5) 10Dzahn: mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [17:43:08] I think that would be perfect [17:43:24] cool, then yeah ill make the list for ya no problem [17:43:32] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [17:43:50] thanks :) [17:44:07] legoktm: its @member right? [17:44:09] you have dot member [17:44:25] oops, yes [17:44:37] so you'll get two password emails [17:44:43] (fixed on the task) [17:44:43] ignore the first since it'll invalidate when i make the second [17:44:48] the second will go to both you and mortiz [17:44:51] * legoktm nods [17:44:53] first only goes to you [17:45:35] huh, mailman password must have changed on me. [17:47:24] nope, password changed in my manager by accident [17:47:30] must have had it selected and hotkeyed it... [17:47:43] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182828 (10Vgutierrez) ``` root@lvs1016:~# ethtool -l enp4s0f0 Channel parameters for enp4s0f0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 15 Current hardware settin... [17:48:19] (03PS6) 10Dzahn: mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [17:48:50] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [17:48:55] arrrrr [17:49:07] ACKNOWLEDGEMENT - puppet last run on lvs1016 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[ethtool_rss_combined_channels_enp4s0f0],Exec[ethtool_rss_combined_channels_enp4s0f1] Vgutierrez https://phabricator.wikimedia.org/T184293#4182828 [17:51:05] 10Operations, 10netops: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323#4182848 (10ayounsi) 05Open>03Resolved [17:51:38] 10Operations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4182849 (10ayounsi) [17:51:59] Join wikipedia-en [17:52:10] JOIN #wikipedia-en [17:53:39] you need to prefix with a / [17:53:43] e.g. [17:53:43] (03PS7) 10Dzahn: mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [17:53:51] /joing #wikipedia-e [17:53:53] /joing #wikipedia-en [17:53:57] /join #wikipedia-en [17:53:59] wow fail [17:54:03] rofl [17:54:06] Guest51630: --^ [17:54:13] (03CR) 10Hoo man: [C: 031] "Drives me crazy every single timeā€¦" [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [18:00:30] (03PS8) 10Dzahn: mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [18:00:43] (03CR) 10Dzahn: [C: 031] "no-change on terbium/wasat: http://puppet-compiler.wmflabs.org/11122/" [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:01:12] (03CR) 10Dzahn: [C: 032] mw-maintenance: add PHP7 support, php-readline version [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:01:17] (03CR) 10Faidon Liambotis: [C: 04-1] "Thousand times yes on the concept, but -1 for a few nitpicky details :)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [18:02:10] (03PS1) 10Ayounsi: Add Tata AS# to the Critical ASN list [puppet] - 10https://gerrit.wikimedia.org/r/430945 [18:02:23] (03PS2) 10Ayounsi: Add Tata AS# to the Critical ASN list [puppet] - 10https://gerrit.wikimedia.org/r/430945 [18:03:22] (03CR) 10Ayounsi: [C: 032] Add Tata AS# to the Critical ASN list [puppet] - 10https://gerrit.wikimedia.org/r/430945 (owner: 10Ayounsi) [18:03:45] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2189.codfw.wmnet [18:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:43] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2190.codfw.wmnet [18:05:45] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#4182875 (10BBlack) [18:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:16] (03PS4) 10Andrew Bogott: vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 [18:07:45] (03CR) 10Bstorm: "Are we ok with merging this one now with special attention the first time the code actually is invoked (which will be when a new wiki come" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [18:12:09] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2192.codfw.wmnet [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:31] ok, got sidetracked with pwstore issue [18:15:40] finishing up mediawiki-debian creation [18:16:08] !log mw2197,mw2198,mw2199 - reinstall with stretch [18:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:29] win 23 [18:21:00] robh: thanks, got the second email :) [18:22:07] (03CR) 10Marostegui: [C: 04-1] "I prefer if we don't enable notifications on db1116/db1118 as those hosts are not production." [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [18:23:13] (03CR) 10Marostegui: [C: 031] "> Are we ok with merging this one now with special attention the" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [18:23:58] (03CR) 10Marostegui: "Probably best to depool it on Monday, as it is late in EU timezone and I wouldn't want to leave it depooled over the weekend" [puppet] - 10https://gerrit.wikimedia.org/r/430942 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:35:29] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2195.codfw.wmnet [18:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:00] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2194.codfw.wmnet [18:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:06] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2196.codfw.wmnet [18:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:46] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:49:35] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:51:47] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:05] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:58:14] !log sbisson@tin Started deploy [kartotherian/deploy@8e6b35b]: Use new keyspace (v4) for both i18n and non-i18n sources [18:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:10] !log sbisson@tin Finished deploy [kartotherian/deploy@8e6b35b]: Use new keyspace (v4) for both i18n and non-i18n sources (duration: 03m 57s) [19:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:22] PROBLEM - Apache HTTP on mw2197 is CRITICAL: connect to address 10.192.32.85 and port 80: Connection refused [19:04:01] PROBLEM - Check size of conntrack table on mw2197 is CRITICAL: Return code of 255 is out of bounds [19:04:01] PROBLEM - MD RAID on mw2197 is CRITICAL: Return code of 255 is out of bounds [19:05:02] ^me, donwtimed [19:41:07] (03PS3) 10Dzahn: tcpircbot: add mwmaint1001 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) [19:41:33] (03CR) 10Dzahn: "mwmaint1001 now has the mapped address:" [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [19:45:15] (03PS1) 10Dzahn: add IPv6 records for mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430959 (https://phabricator.wikimedia.org/T192092) [19:45:23] (03CR) 10jerkins-bot: [V: 04-1] add IPv6 records for mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430959 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [19:45:38] (03PS2) 10Dzahn: add IPv6 records for mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430959 (https://phabricator.wikimedia.org/T192092) [19:49:52] (03CR) 10Bstorm: "> Probably best to depool it on Monday, as it is late in EU timezone" [puppet] - 10https://gerrit.wikimedia.org/r/430942 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [19:50:10] (03PS3) 10Bstorm: wiki replicas: add GRANT statement to $wiki_p database creation [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) [19:50:40] !log rebooting lvs1016 (downtimed, also new and not in service!) [19:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:13] (03CR) 10Bstorm: [C: 032] wiki replicas: add GRANT statement to $wiki_p database creation [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [19:55:28] (03CR) 10Bstorm: [C: 032] "It's in there now. We'll take a good look at what it does on the first replica I run it on." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [20:00:51] (03CR) 10Dzahn: [C: 032] add IPv6 records for mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430959 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [20:04:10] (03CR) 10Dzahn: [C: 032] "v6 records added to DNS https://gerrit.wikimedia.org/r/#/c/430959/" [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [20:04:18] (03PS1) 10Hashar: (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) [20:04:25] (03PS4) 10Dzahn: tcpircbot: add mwmaint1001 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) [20:05:00] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:05:59] (03Abandoned) 10Hashar: (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:07:32] (03CR) 10Dzahn: network: add mwmaint1001 to network constants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [20:11:49] (03PS2) 10Dzahn: Revert "Revert "mwmaint1001: add mediawiki-maintenance role"" [puppet] - 10https://gerrit.wikimedia.org/r/430812 [20:12:01] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183141 (10BBlack) The key to the ethool difference is this in the lspci stuff: ` Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-` vs ` Capabilities: [a0] MSI-X: Enable+ C... [20:14:25] (03PS1) 10Hashar: 0.18.4-wmf3: fix B90lintian hook not finding the package name. [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/430979 (https://phabricator.wikimedia.org/T193906) [20:18:41] (03Restored) 10Hashar: (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:19:03] (03CR) 10Hashar: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:19:43] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:19:58] (03Abandoned) 10Hashar: (DO NOT SUBMIT) Test with builddep 'coreutils' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430976 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:20:06] (03CR) 10Hashar: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:22:40] (03CR) 10Hashar: "debian-glue job solved as part of T193906 ;]" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:25:52] (03CR) 10Hashar: [C: 032] 0.18.4-wmf3: fix B90lintian hook not finding the package name. [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/430979 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:26:27] (03Merged) 10jenkins-bot: 0.18.4-wmf3: fix B90lintian hook not finding the package name. [debs/jenkins-debian-glue] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/430979 (https://phabricator.wikimedia.org/T193906) (owner: 10Hashar) [20:42:10] 10Operations, 10Continuous-Integration-Infrastructure: Build and upload jenkins-debian-glue_0.18.4-wmf3 for jessie - https://phabricator.wikimedia.org/T193910#4183192 (10hashar) [20:43:43] 10Operations, 10Continuous-Integration-Infrastructure: Build and upload jenkins-debian-glue_0.18.4-wmf3 for jessie - https://phabricator.wikimedia.org/T193910#4183192 (10hashar) [20:45:15] 10Operations, 10Continuous-Integration-Infrastructure: Build and upload jenkins-debian-glue_0.18.4-wmf3 for jessie - https://phabricator.wikimedia.org/T193910#4183214 (10Dzahn) a:03Dzahn [20:49:13] (03PS3) 10Dzahn: Revert "Revert "mwmaint1001: add mediawiki-maintenance role"" [puppet] - 10https://gerrit.wikimedia.org/r/430812 [20:54:45] RECOVERY - Apache HTTP on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [20:59:22] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#4183217 (10RobH) This seems fixed by adding the rootdelay for jessie and older, and stretch has it go away. If it happens on jessie, adding a rootdelay=15 to the initial boot post install fixes it. If... [21:00:41] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4183218 (10RobH) a:03Papaul @papaul: Please go ahead and process a warranty replacement for this disk with HP. If it is how swap (should be) we can replace without downtime. [21:08:02] 10Operations, 10ops-codfw: lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017#2712284 (10RobH) This is still doing this as of May 2nd on lvs2002 On the other task linked T148016 just had the issue resolve itself, with no notes of any corrective action. [21:08:08] (03CR) 10RobH: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [21:49:51] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4183305 (10herron) >>! In T193766#4181267, @fgiunchedi wrote: > * Capacity - I chatted with @gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet wet with multiple indices instead of one sin... [21:55:10] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2197.codfw.wmnet [21:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2198.codfw.wmnet [22:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:44] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2199.codfw.wmnet [22:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:46] jouncebot: next [22:08:46] In 60 hour(s) and 51 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T1100) [22:09:01] updates a scap proxy in that case [22:39:27] (03PS1) 10Catrope: Enable ORES on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431035 (https://phabricator.wikimedia.org/T192498) [22:41:32] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#4183398 (10RobH) [22:41:34] 10Operations, 10ops-ulsfo: Multiple systems in ulsfo 1.22 showing PSU failures - https://phabricator.wikimedia.org/T177622#4183396 (10RobH) 05Open>03Resolved a:03RobH [22:46:24] !log mw2191, mw2193 - wmf-auto-reimage with --no-verify because puppet certs didnt exist [22:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:32] (03PS1) 10Catrope: Enable ORES on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431036 (https://phabricator.wikimedia.org/T192501) [22:47:15] (03CR) 10Dzahn: [C: 032] Revert "Revert "mwmaint1001: add mediawiki-maintenance role"" [puppet] - 10https://gerrit.wikimedia.org/r/430812 (owner: 10Dzahn) [22:47:38] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431036 (https://phabricator.wikimedia.org/T192501) (owner: 10Catrope) [22:49:50] !log mw2187 - scap proxy - reinstalling with stretch [22:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:57] V-1 for it being Friday :p [22:50:13] !log mwmaint1001 - now using mw-maintenance role, upcoming terbium replacement [22:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:13] awight: hehe! (19715 | ERROR | [x] Expected 1 space after "=>"; 2 found) [22:51:34] /o\ [22:51:40] lint --pedantic [22:56:36] (03PS2) 10Dzahn: Move scap proxy in C4 to mw2188 [puppet] - 10https://gerrit.wikimedia.org/r/430918 (owner: 10Muehlenhoff) [22:57:11] (03CR) 10Dzahn: [C: 032] Move scap proxy in C4 to mw2188 [puppet] - 10https://gerrit.wikimedia.org/r/430918 (owner: 10Muehlenhoff) [22:58:36] (03PS1) 10Catrope: Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 [22:59:46] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (owner: 10Catrope) [23:00:43] (03PS2) 10Catrope: Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (https://phabricator.wikimedia.org/T192499) [23:01:58] PROBLEM - Nginx local proxy to apache on mw2187 is CRITICAL: connect to address 10.192.32.75 and port 443: Connection refused [23:01:59] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2187 is CRITICAL: Return code of 255 is out of bounds [23:02:18] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (https://phabricator.wikimedia.org/T192499) (owner: 10Catrope) [23:04:04] RoanKattouw: it hates that you have 2 spaces after an "=>" somewhere [23:04:12] just to save you more if you are copy/pasting for more langs [23:04:13] Argh oops thanks [23:05:01] (03PS2) 10Catrope: Enable ORES on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431036 (https://phabricator.wikimedia.org/T192501) [23:05:03] (03PS3) 10Catrope: Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (https://phabricator.wikimedia.org/T192499) [23:05:18] PROBLEM - mediawiki-installation DSH group on mwmaint1001 is CRITICAL: Host mwmaint1001 is not in mediawiki-installation dsh group [23:05:34] ^ that's me, it has the very first puppet run and is to replace terbium [23:06:12] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (https://phabricator.wikimedia.org/T192499) (owner: 10Catrope) [23:06:28] i did not add it to scap hosts yet.. separate patch on purpose [23:08:43] (03PS2) 10Dzahn: prometheus_check_metric: print message when status is OK [puppet] - 10https://gerrit.wikimedia.org/r/430898 (https://phabricator.wikimedia.org/T193793) (owner: 10Filippo Giunchedi) [23:11:06] (03CR) 10Dzahn: [C: 032] prometheus_check_metric: print message when status is OK [puppet] - 10https://gerrit.wikimedia.org/r/430898 (https://phabricator.wikimedia.org/T193793) (owner: 10Filippo Giunchedi) [23:11:08] (03PS1) 10Dzahn: switch mw-maintenance server from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/431039 (https://phabricator.wikimedia.org/T192092) [23:14:31] (03PS1) 10Catrope: Enable ORES on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431040 (https://phabricator.wikimedia.org/T192496) [23:15:42] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431040 (https://phabricator.wikimedia.org/T192496) (owner: 10Catrope) [23:20:11] (03PS1) 10Dzahn: decom terbium: rm from scap,site,dhcp,network constants [puppet] - 10https://gerrit.wikimedia.org/r/431041 (https://phabricator.wikimedia.org/T192092) [23:21:54] (03PS1) 10Dzahn: mariadb: remove grants for terbium (do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/431042 (https://phabricator.wikimedia.org/T192092) [23:24:36] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915#4183477 (10Dzahn) p:05Triage>03High [23:24:52] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915#4183477 (10Dzahn) p:05High>03Normal [23:26:44] (03PS2) 10Catrope: Enable ORES on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431040 (https://phabricator.wikimedia.org/T192496) [23:27:55] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431040 (https://phabricator.wikimedia.org/T192496) (owner: 10Catrope) [23:29:01] 10Operations, 10Release-Engineering-Team (Watching / External): rename mira to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183490 (10Dzahn) p:05Triage>03High [23:29:15] 10Operations, 10Release-Engineering-Team (Watching / External): rename mira to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183490 (10Dzahn) p:05High>03Normal [23:29:41] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4183503 (10Dzahn) [23:31:03] 10Operations, 10monitoring, 10Patch-For-Review: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4183509 (10Dzahn) 05Open>03Resolved a:03Dzahn Thanks! merged and confirmed working. It shows the text on Icinga. [23:31:21] 10Operations, 10monitoring, 10Patch-For-Review: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4183512 (10Dzahn) a:05Dzahn>03fgiunchedi [23:32:02] PROBLEM - mediawiki-installation DSH group on mw2191 is CRITICAL: Host mw2191 is not in mediawiki-installation dsh group [23:32:15] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915#4183513 (10Dzahn) check if wasat hardware is new enough to just be renamed or whether it should also be replaced [23:40:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4183530 (10thcipriani) I created `deployment-deploy1001` as a stretch box. Here are my notes: Create new instance =================== Vi... [23:52:30] (03PS1) 10Dzahn: mw-maintenance: enable crons based on fqdn, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) [23:58:33] (03CR) 10Dzahn: "works as intended in compiler: terbium and wasat no change, mwmaint1001: crons get disabled" [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [23:59:36] (03CR) 10Dzahn: [C: 032] mw-maintenance: enable crons based on fqdn, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)