[02:57:57] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.1) (duration: 08m 18s) [02:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.70 seconds [03:55:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 214.81 seconds [05:23:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429710 (https://phabricator.wikimedia.org/T190148) [05:25:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429710 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:26:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429710 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:28:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 for alter table (duration: 01m 09s) [05:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:48] !log Deploy schema change on db1074 - T191519 T188299 T190148 [05:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:53] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:28:53] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:28:54] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:30:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429710 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:41:41] (03PS1) 10Marostegui: s8.hosts: Add db1116:3318 to s8 [software] - 10https://gerrit.wikimedia.org/r/429711 (https://phabricator.wikimedia.org/T190704) [05:43:47] (03CR) 10Marostegui: [C: 032] s8.hosts: Add db1116:3318 to s8 [software] - 10https://gerrit.wikimedia.org/r/429711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:44:33] (03Merged) 10jenkins-bot: s8.hosts: Add db1116:3318 to s8 [software] - 10https://gerrit.wikimedia.org/r/429711 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:47:29] !log Drop table edit_page_tracking from s6 - T57385 [05:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:34] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [05:48:30] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4167092 (10Marostegui) [05:50:43] !log Drop table edit_page_tracking from s4, s5 and s7 - T57385 [05:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:04] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4167094 (10Marostegui) [05:57:35] (03PS4) 10Elukey: role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 (https://phabricator.wikimedia.org/T193257) [05:59:33] !log Drop table edit_page_tracking from s1 - T57385 [05:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:37] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [06:01:07] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4167107 (10Marostegui) [06:04:56] !log Drop table edit_page_tracking from s2 - T57385 [06:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:00] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [06:11:01] !log Drop table edit_page_tracking from s3 - T57385 [06:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:05] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [06:19:53] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4167143 (10Marostegui) [06:20:19] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#589294 (10Marostegui) 05Open>03Resolved This has been dropped everywhere [06:38:09] 10Operations, 10DBA, 10Chinese-Sites: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#551990 (10Marostegui) After all the deletions that have happened lately as part of the parent ticket, this is the current status of these tables. They only exist on s3 on: ``` chw... [06:41:11] (03PS1) 10Vgutierrez: install_server: Reimage lvs3002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429723 (https://phabricator.wikimedia.org/T191897) [06:43:50] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs3002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429723 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [06:48:03] !log Depool and reimage lvs3002 - T191897 [06:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:07] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [07:05:41] !log Temporary stop replication on db1095:s3 [07:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:18] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167254 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3002.esams.wmnet ``` The log can be found in `/var/lo... [07:06:43] !log Restart replication on db1095:s3 [07:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:04] (03PS1) 10ArielGlenn: remove all references to dataset1001 from rsync and related manifests [puppet] - 10https://gerrit.wikimedia.org/r/429724 (https://phabricator.wikimedia.org/T182540) [07:18:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.136 second response time [07:20:56] 10Operations, 10Traffic: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4167263 (10Vgutierrez) p:05Triage>03Normal [07:24:02] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 (https://phabricator.wikimedia.org/T193257) (owner: 10Elukey) [07:24:07] (03PS5) 10Elukey: role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 (https://phabricator.wikimedia.org/T193257) [07:27:04] (03PS2) 10ArielGlenn: remove all references to dataset1001 from rsync and related manifests [puppet] - 10https://gerrit.wikimedia.org/r/429724 (https://phabricator.wikimedia.org/T182540) [07:27:41] (03CR) 10ArielGlenn: [C: 032] remove all references to dataset1001 from rsync and related manifests [puppet] - 10https://gerrit.wikimedia.org/r/429724 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [07:28:10] (03PS1) 10Jcrespo: mariadb: Increase buffer pool for enwiki at dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/429725 [07:28:31] (03PS2) 10Jcrespo: mariadb: Increase buffer pool for enwiki at dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/429725 [07:31:01] (03CR) 10Jcrespo: [C: 032] mariadb: Increase buffer pool for enwiki at dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/429725 (owner: 10Jcrespo) [07:31:11] !log restart HDFS namenode on analytics1002 (standby master) to pick up new JVM settings - T193257 [07:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:15] T193257: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257 [07:32:02] (03PS1) 10ArielGlenn: stop rsync of web logs from dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429726 (https://phabricator.wikimedia.org/T182540) [07:33:32] !log restarting dbstore1001@s1 to apply config change [07:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:39] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#4167283 (10fgiunchedi) [07:34:23] (03PS2) 10ArielGlenn: stop rsync of web logs from dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429726 (https://phabricator.wikimedia.org/T182540) [07:35:21] (03CR) 10ArielGlenn: [C: 032] stop rsync of web logs from dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429726 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [07:36:04] (03PS2) 10Filippo Giunchedi: prometheus: define recording rules for k8s alerts [puppet] - 10https://gerrit.wikimedia.org/r/429415 (https://phabricator.wikimedia.org/T193186) [07:40:43] (03CR) 10Gilles: "@Imarlier if it's no longer needed, you can "abandon" the patch in gerrit's UI, which closes it properly." [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [07:41:07] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/429727 [07:43:30] (03PS1) 10ArielGlenn: remove production roles from ms1001, dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429728 (https://phabricator.wikimedia.org/T182540) [07:44:52] (03CR) 10Gilles: "Ah, Ori owns it, nevermind :D" [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [07:45:16] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: define recording rules for k8s alerts [puppet] - 10https://gerrit.wikimedia.org/r/429415 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [07:45:18] (03CR) 10Gilles: "And it's already abandoned. Sigh, it's too early in the morning." [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [07:45:55] * _joe_ hands gilles a cup of coffee [07:45:57] <_joe_> :D [07:46:00] right [07:53:26] (03PS4) 10Filippo Giunchedi: Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) [07:54:22] (03CR) 10Filippo Giunchedi: [C: 032] Add puppetization for mcrouter_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [07:59:07] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167320 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3002.esams.wmnet'] ``` and were **ALL** successful. [08:00:10] (03PS2) 10Gehel: wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) [08:02:57] !log stopping replication on both db1090 db instances to finish maintenance [08:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:00] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs3002 [puppet] - 10https://gerrit.wikimedia.org/r/429732 (https://phabricator.wikimedia.org/T191897) [08:06:29] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs3002 [puppet] - 10https://gerrit.wikimedia.org/r/429732 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [08:11:22] (03PS1) 10ArielGlenn: do tables dumps before stubs for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/429733 [08:15:05] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370#4167353 (10fgiunchedi) [08:15:10] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4167351 (10fgiunchedi) 05Open>03Resolved Upstream has merged the changes I submitted, the Debian package... [08:15:22] !log Repool (Re-enable BGP) in lvs3002 - T191897 [08:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:26] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [08:15:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 41.24, 36.43, 31.86 [08:16:04] !log force a manual failover of the HDFS Namenode from analytics1001 to analytics1002 to test new GC Settings - T193257 [08:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:10] T193257: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257 [08:18:39] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167357 (10Vgutierrez) [08:18:53] (03PS2) 10ArielGlenn: do tables dumps before stubs for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/429733 [08:19:22] (03PS1) 10Jcrespo: mariadb: Repool both db1090 instances into s2 and s7 respectivelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429736 (https://phabricator.wikimedia.org/T192979) [08:20:21] (03CR) 10ArielGlenn: [C: 032] do tables dumps before stubs for enwiki [puppet] - 10https://gerrit.wikimedia.org/r/429733 (owner: 10ArielGlenn) [08:20:39] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db1090 before repooling it [puppet] - 10https://gerrit.wikimedia.org/r/429737 (https://phabricator.wikimedia.org/T192979) [08:21:55] (03Abandoned) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/429727 (owner: 10Marostegui) [08:23:15] (03PS2) 10Volans: wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229 [08:23:17] (03PS2) 10Volans: wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230 [08:23:19] (03PS1) 10Volans: wmf-auto-reimage: increase timeout for Puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/429738 [08:23:21] !log swift eqiad-prod more weight to ms-be104[0-3] - T191896 [08:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] T191896: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896 [08:23:39] (03CR) 10Volans: "@alex: comments addressed" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429229 (owner: 10Volans) [08:25:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 33.12, 32.56, 32.02 [08:26:55] (03PS1) 10Marostegui: redact_sanitarium.sh: Remove db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429739 (https://phabricator.wikimedia.org/T190704) [08:27:51] (03PS2) 10Marostegui: redact_sanitarium.sh: Remove db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429739 (https://phabricator.wikimedia.org/T190704) [08:29:27] (03CR) 10Jcrespo: [C: 031] redact_sanitarium.sh: Remove db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429739 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:29:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429742 [08:29:41] (03CR) 10Marostegui: [C: 032] redact_sanitarium.sh: Remove db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429739 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:30:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429742 (owner: 10Marostegui) [08:32:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429742 (owner: 10Marostegui) [08:32:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429742 (owner: 10Marostegui) [08:33:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 after alter table (duration: 01m 00s) [08:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:05] (03CR) 10Volans: [C: 04-1] "I think this is changing the structure of the JSON, see comment inline." (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/429244 (owner: 10Muehlenhoff) [08:35:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429744 (https://phabricator.wikimedia.org/T190148) [08:37:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429744 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:38:15] !log restart HDFS namenode on analytics1001 (standby master) to pick up new JVM settings - T193257 [08:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:19] T193257: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257 [08:38:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429744 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:38:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429744 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:39:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 for alter table (duration: 00m 59s) [08:39:34] !log Deploy schema change on db1076 - T191519 T188299 T190148 [08:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:39] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:39:39] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [08:39:39] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [08:40:04] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db1090 before repooling it [puppet] - 10https://gerrit.wikimedia.org/r/429737 (https://phabricator.wikimedia.org/T192979) [08:41:28] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications for db1090 before repooling it [puppet] - 10https://gerrit.wikimedia.org/r/429737 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:44:16] (03PS2) 10Jcrespo: mariadb: Repool both db1090 instances into s2 and s7 respectivelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429736 (https://phabricator.wikimedia.org/T192979) [08:48:18] (03PS1) 10Vgutierrez: install_server: Reimage lvs3001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429745 (https://phabricator.wikimedia.org/T191897) [08:49:51] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs3001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429745 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [08:50:54] (03CR) 10Jcrespo: [C: 032] mariadb: Repool both db1090 instances into s2 and s7 respectivelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429736 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:52:14] (03Merged) 10jenkins-bot: mariadb: Repool both db1090 instances into s2 and s7 respectivelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429736 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:52:30] (03CR) 10jenkins-bot: mariadb: Repool both db1090 instances into s2 and s7 respectivelly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429736 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:57:59] (03PS1) 10Jcrespo: mariadb: Depool db1060, db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) [09:01:36] !log Depool and reimage lvs3001 as stretch - T191897 [09:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:40] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [09:03:10] (03PS1) 10Alexandros Kosiaris: Reimage ganeti2008 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429748 (https://phabricator.wikimedia.org/T193121) [09:03:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 (duration: 00m 59s) [09:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:42] PROBLEM - ganeti-noded running on ganeti2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [09:03:52] PROBLEM - ganeti-mond running on ganeti2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond [09:04:01] PROBLEM - ganeti-confd running on ganeti2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd [09:04:26] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ganeti2008 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/429748 (https://phabricator.wikimedia.org/T193121) (owner: 10Alexandros Kosiaris) [09:04:42] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [09:04:52] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [09:05:41] PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1003.eqiad.wmnet:2379 (min=4) [09:06:28] ah ok Valentin is reimaging :D [09:06:35] for a moment I was really worried :D [09:06:58] (03PS1) 10ArielGlenn: don't update rss feed file if feed for more recent output from job exists [dumps] - 10https://gerrit.wikimedia.org/r/429751 [09:07:39] my fault :) [09:07:52] let me silence it [09:09:23] (03PS2) 10ArielGlenn: don't update rss feed file if feed for more recent output from job exists [dumps] - 10https://gerrit.wikimedia.org/r/429751 [09:12:52] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4167452 (10mobrovac) [09:12:53] (03CR) 10ArielGlenn: [C: 032] don't update rss feed file if feed for more recent output from job exists [dumps] - 10https://gerrit.wikimedia.org/r/429751 (owner: 10ArielGlenn) [09:15:08] !log ariel@tin Started deploy [dumps/dumps@a6baf69]: do not update existing rss feed file if the dump job it covers is more recent than the one for which a feed is requested [09:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] !log ariel@tin Finished deploy [dumps/dumps@a6baf69]: do not update existing rss feed file if the dump job it covers is more recent than the one for which a feed is requested (duration: 00m 04s) [09:15:15] (03CR) 10Filippo Giunchedi: wdqs: add standard prometheus JVM monitoring to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:41] PROBLEM - puppet last run on ganeti2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:30] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [09:23:23] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167455 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3001.esams.wmnet ``` The log can be found in `/var/lo... [09:23:43] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1060, db1069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [09:24:37] (03PS2) 10Jcrespo: mariadb: Depool db1060, pool fully db1090 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) [09:26:15] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1060, pool fully db1090 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [09:27:28] (03Merged) 10jenkins-bot: mariadb: Depool db1060, pool fully db1090 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [09:29:31] (03CR) 10jenkins-bot: mariadb: Depool db1060, pool fully db1090 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429747 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [09:32:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1060, fully pool db1090 (duration: 00m 59s) [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:54] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4167465 (10Joe) After some consideration, I see three options moving forward: **Option A**: - Mcrouter is installed on one memcached host per row, where... [09:46:45] Hallo. Is it possible to use JSON_ARRAYAGG in mysql on terbium? [09:47:17] (If there's a better channel for database questions, please tell me.) [09:47:57] !log restart HDFS Namenode on analtics1001 (current standby) [09:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:06] (03PS5) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [09:48:51] (03CR) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [09:49:17] If I try it, I get: execute command denied to user 'research_prod'@'%' for routine 'log.JSON_ARRAYAGG' [09:50:25] !log restart HDFS Namenode on analtics1001 (current standby) again with Xmx/Xms set to 8g [09:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] What I need to do is to group by one column and to get all the results from another column in one row. I could do GROUP_CONCAT, but since I'm doing it with data that includes user input, I cannot join the values in a safe way with a separator. JSON looks safe, but apparently cannot be used. [09:52:39] aharoni: #wikimedia-databases [09:53:19] jynus: thanks, asking there [09:56:34] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master|standby: set NN heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/429756 (https://phabricator.wikimedia.org/T193257) [09:57:26] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop::master|standby: set NN heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/429756 (https://phabricator.wikimedia.org/T193257) (owner: 10Elukey) [10:00:45] !log set analytics1001 as active HDFS Namenode using manual failover [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:13] PROBLEM - Disk space on lvs3001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:02:13] PROBLEM - dhclient process on lvs3001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:02:28] ^^ that's me reimaging lvs3001 [10:03:13] RECOVERY - Disk space on lvs3001 is OK: DISK OK [10:03:13] RECOVERY - dhclient process on lvs3001 is OK: PROCS OK: 0 processes with command name dhclient [10:04:28] (03PS1) 10Arturo Borrero Gonzalez: ruby: install libmysqlclient-dev package in the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/429758 (https://phabricator.wikimedia.org/T192566) [10:06:20] !log restart hdfs namenode on analytics1002 to pick up new heap settings (last step of the maintenance) [10:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:08] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167518 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3001.esams.wmnet'] ``` and were **ALL** successful. [10:23:24] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 9.878e+05 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [10:25:47] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs3001 [puppet] - 10https://gerrit.wikimedia.org/r/429760 (https://phabricator.wikimedia.org/T191897) [10:27:52] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs3001 [puppet] - 10https://gerrit.wikimedia.org/r/429760 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:30:29] !log Repool (Re-enable BGP) lvs3001 - T191897 [10:30:30] (03CR) 10Filippo Giunchedi: [C: 031] icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [10:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:33] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:32:14] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4167568 (10Vgutierrez) [10:32:47] (03PS1) 10ArielGlenn: cap dump rsyncs to peers at bandwidth of 40000K per second [puppet] - 10https://gerrit.wikimedia.org/r/429761 (https://phabricator.wikimedia.org/T191177) [10:33:30] (03CR) 10ArielGlenn: [C: 032] cap dump rsyncs to peers at bandwidth of 40000K per second [puppet] - 10https://gerrit.wikimedia.org/r/429761 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [10:33:35] (03PS2) 10ArielGlenn: cap dump rsyncs to peers at bandwidth of 40000K per second [puppet] - 10https://gerrit.wikimedia.org/r/429761 (https://phabricator.wikimedia.org/T191177) [10:34:17] !log Updating puppet compiler facts [10:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:36] (03PS1) 10Ema: Ignore req.ttl when keeping track of expired objects [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429762 [10:40:38] (03PS1) 10Urbanecm: Enable on Marathi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) [10:42:40] (03PS3) 10Ema: Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 [10:44:30] (03PS1) 10Vgutierrez: hieradata: clean-up esams lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/429764 (https://phabricator.wikimedia.org/T191897) [10:48:13] (03CR) 10Vgutierrez: [C: 032] "pcc is happy and shows noop: https://puppet-compiler.wmflabs.org/compiler02/11072/" [puppet] - 10https://gerrit.wikimedia.org/r/429764 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [11:00:05] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T1100). Please do the needful. [11:12:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4167675 (10MarcoAurelio) I think @mobrovac knows about Kafka as well, and helped the last week resolve an issue on the cpjobqueue se... [11:49:16] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1956 bytes in 0.105 second response time [11:53:29] (03PS6) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [11:58:20] (03CR) 10Urbanecm: [C: 04-1] "Do not deploy. See the linked task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [12:18:05] (03PS1) 10Volans: Add entries for ganeti instances for DebMonitor [dns] - 10https://gerrit.wikimedia.org/r/429780 (https://phabricator.wikimedia.org/T191299) [12:30:58] (03PS1) 10Alexandros Kosiaris: ganeti: Default hiera call to empty array [puppet] - 10https://gerrit.wikimedia.org/r/429784 [12:31:54] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Default hiera call to empty array [puppet] - 10https://gerrit.wikimedia.org/r/429784 (owner: 10Alexandros Kosiaris) [12:37:17] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4167810 (10Joe) >>! In T192771#4167477, @Volans wrote: > I've an additional question, what is the expected behaviour in the following failure scenarios fo... [12:38:46] PROBLEM - configured eth on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:38:54] _joe_: thanks, and I guess it will require manual depooling right? [12:39:41] <_joe_> volans: or not, we can choose [12:40:05] would nutcracker be able to automatically detect the failed mcrouter instace? [12:40:09] *instance [12:40:26] PROBLEM - Check size of conntrack table on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:40:26] PROBLEM - dhclient process on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:42:06] PROBLEM - Check systemd state on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:42:06] PROBLEM - ganeti-confd running on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:43:49] (03CR) 10Giuseppe Lavagetto: [C: 031] Add entries for ganeti instances for DebMonitor [dns] - 10https://gerrit.wikimedia.org/r/429780 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:43:56] PROBLEM - Check the NTP synchronisation status of timesyncd on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:43:56] PROBLEM - ganeti-mond running on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:44:21] <_joe_> I guess this is ganeti defending itself from volans [12:44:31] ahahah AFAIK it was reimaged [12:45:25] (03CR) 10Volans: [C: 032] Add entries for ganeti instances for DebMonitor [dns] - 10https://gerrit.wikimedia.org/r/429780 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:45:36] PROBLEM - Check whether ferm is active by checking the default input chain on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:45:36] PROBLEM - ganeti-noded running on ganeti2008 is CRITICAL: Return code of 255 is out of bounds [12:46:38] akosiaris: is it you re-imaging it? ^^^ [12:47:16] yes [12:47:16] PROBLEM - DPKG on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:47:16] PROBLEM - puppet last run on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:47:51] the puppet profile had an error so it never happened automagically [12:47:58] akosiaris: should I hold performing ganeti admin actions on the codfw cluster? (create new instance) [12:48:04] no [12:48:11] I 've fully depooled it [12:48:16] ganeti2008 that is [12:48:18] ack, thanks [12:48:37] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [12:48:57] PROBLEM - Disk space on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:48:57] !log aborrero@labtestnet2001:~ $ sudo rm /var/log/upstart/nova-api.log.1 <--- disk full, logrotate refuses to work bc that [12:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:06] PROBLEM - ganeti-confd running on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:07] PROBLEM - Check systemd state on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:13] 10Operations, 10Patch-For-Review: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4167815 (10akosiaris) [12:49:16] PROBLEM - DPKG on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:26] PROBLEM - Check size of conntrack table on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:26] PROBLEM - dhclient process on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:36] 10Operations, 10Patch-For-Review: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4160393 (10akosiaris) ganeti2008 has been emptied and reimaged as stretch. [12:49:56] PROBLEM - IPMI Sensor Status on ganeti2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:51:33] akosiaris: if 'automagically' means with the wmf-auto-reimage script it can be re-launched with options to skip the already done steps after the fix ;) [12:53:03] volans: it was just the puppet agent -t run, I did it manually [12:53:08] I am now evaluating [12:53:16] and of course megacli is not present on stretch [12:53:19] problem #1 met [12:54:13] the script does few more things for you after that , but not too many so either way :) [12:54:46] RECOVERY - configured eth on ganeti2008 is OK: OK - interfaces up [12:55:06] RECOVERY - Disk space on ganeti2008 is OK: DISK OK [12:55:16] RECOVERY - Check systemd state on ganeti2008 is OK: OK - running: The system is fully operational [12:55:17] RECOVERY - DPKG on ganeti2008 is OK: All packages OK [12:55:26] RECOVERY - Check size of conntrack table on ganeti2008 is OK: OK: nf_conntrack is 0 % full [12:55:27] RECOVERY - dhclient process on ganeti2008 is OK: PROCS OK: 0 processes with command name dhclient [12:55:36] RECOVERY - Check whether ferm is active by checking the default input chain on ganeti2008 is OK: OK ferm input default policy is set [12:56:20] why is it even alerting ? it's on a scheduled downtime [12:56:42] it was, then when we do node deactivate and puppet runs on icinga it disappear [12:57:00] start time 2018-04-30 12:50:47 [12:57:02] weird [12:57:19] then it reappears again, the script does a downtime again after the first puppet run but the first run is too long [12:57:38] no, I 've submitted that scheduled downtime [12:57:45] when? [12:57:52] all the checks were there already? [12:57:54] at 2018-04-30 12:50:47 [12:57:59] looks like it [12:58:14] the downtime host and all checks affect only the existing ones, not new ones added later as I'm sure you know [12:58:26] they are listed as under downtime in the UI [12:58:33] oh wait, you mean the recoveries? [12:58:35] yes [12:58:40] those ignore the downtime, because it's a recovery [12:58:45] what ? [12:58:47] and icinga is happy to let you know that [12:58:49] really ? [12:59:00] how come I did not know that ? [12:59:11] are you sure ? [12:59:14] I don't remember that [12:59:25] either that or I have some very bad memory loss [12:59:28] I'm double checking, but by memory I think this is the usual behaviour [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T1300). [13:00:04] davidwbarratt, Urbanecm, and Daimona: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:57] here! [13:01:10] I can SWAT today! [13:01:14] Hay [13:01:25] *hey [13:01:57] pre-flight checks completed, looks like clear skies for swat today [13:02:34] A little note: my patch is indeed testable, but needs someone to be blocked for heavy swearing :D Since I'd rather not do it myself, we'll have to wait for a vandalism to confirm it's working [13:02:36] davidwbarratt: reviewing your patch, will ping you in a few minutes when it's ready for testing at mwdebug1002; do you know how to test there? [13:02:54] PROBLEM - MD RAID on wasat is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 [13:02:55] ACKNOWLEDGEMENT - MD RAID on wasat is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T193394 [13:03:00] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4167825 (10ops-monitoring-bot) [13:03:00] Daimona: so I should just deploy it, no testing at mwdebug? [13:03:04] zeljkof yes I do, but the feature hasn't been deployed yet, so there isn't really anything to test [13:03:07] Urbanecm: around for swat? [13:03:12] yes [13:03:19] akosiaris: to be clear I think this is the behaviour of a host/service that was already in alarm once the downtime was set [13:03:23] i.e. we are disabling it before the train goes out [13:03:27] Didn't noticed jouncebot asked me for swat [13:03:38] zeljkof: Yeah, deploy straight away and I'll check if everything goes fine [13:03:39] davidwbarratt: ok, so no point in deploying to mwdebug, I can deploy to production directly? [13:03:40] volans: yes it was [13:03:41] But I had network issues in past minutes, so probably I really didn't recieved it. [13:03:46] But I'm here zeljkof :) [13:03:52] zeljkof if the patch looks good to you, yes. :) [13:04:12] davidwbarratt: ok, reviewing [13:04:14] RECOVERY - IPMI Sensor Status on ganeti2008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [13:04:44] RECOVERY - Check the NTP synchronisation status of timesyncd on ganeti2008 is OK: OK: synced at Mon 2018-04-30 13:04:35 UTC. [13:05:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [13:07:09] (03Merged) 10jenkins-bot: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [13:07:20] davidwbarratt: I forgot to ask, you know you can deploy your change yourself (for future reference)? [13:07:43] zeljkof uhh really? [13:08:02] if you are a deployer... :D [13:08:04] davidwbarratt: well, you have to be a deployer [13:08:17] but if you have to deploy regularly, that is doable [13:09:12] akosiaris: see https://tracker.nagios.org/view.php?id=294 and related [13:09:45] (03CR) 10jenkins-bot: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [13:09:47] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428854|Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia (T192962)]] (duration: 01m 00s) [13:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:51] I know it's for nagios... but I guess they kept the behaviour ;) [13:09:51] T192962: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia - https://phabricator.wikimedia.org/T192962 [13:10:16] davidwbarratt: your patch is deployed, please test and thanks for deploying with #releng ;) [13:11:05] !log reimage analytics1049 and 1050 to Debian Stretch [13:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:25] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) (owner: 10Urbanecm) [13:11:27] zeljkof thanks! [13:11:38] (03PS2) 10Zfilipin: Enable RCPatrol in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) (owner: 10Urbanecm) [13:11:46] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) (owner: 10Urbanecm) [13:11:54] volans: yeah, The reasoning still stands from issue 0000380 and issue 0000035; Problem notifications that happen outside a downtime window will still trigger a recovery notification inside that window. [13:12:04] I honestly do not remember that [13:12:13] or I may have just delete it from my memory [13:12:15] RECOVERY - puppet last run on ganeti2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:12:23] I find it impossible that I have not met this already [13:12:43] rotfl [13:12:59] !log restarting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236 [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:03] T191543: Deploy updated search/extra plugin with Slovak Stemmer - https://phabricator.wikimedia.org/T191543 [13:13:04] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [13:13:11] (03Merged) 10jenkins-bot: Enable RCPatrol in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) (owner: 10Urbanecm) [13:13:18] mobrovac, elukey: ^^ [13:13:49] ok, megacli installed, everything looks fine [13:13:53] well, nothing looks broken from the config change, so I'm going to say it was successful. [13:14:17] gehel: cc ottomata too [13:14:22] thanks for the heads up [13:14:45] Urbanecm: 429442 is at mwdebug [13:14:53] cool [13:15:06] Ok [13:15:36] (03CR) 10jenkins-bot: Enable RCPatrol in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) (owner: 10Urbanecm) [13:16:48] !log Drop unusued _old tables from a few wikis - https://phabricator.wikimedia.org/T54932#4167221 [13:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:21] Daimona: still waiting for CI for your patch... [13:17:24] zeljkof, patch is working [13:17:26] please deploy [13:17:31] Yeah, I'm looking at it [13:17:31] Urbanecm: ok, deploying [13:17:34] thank you [13:17:39] It's taking ages as usual [13:17:41] (03CR) 10Ottomata: [C: 031] "One typo I think but +1!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [13:18:13] 10Operations, 10DBA, 10Chinese-Sites: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#4167876 (10Marostegui) 05Open>03Resolved a:03Marostegui >>! In T54932#4167221, @Marostegui wrote: > After all the deletions that have happened lately as part of the parent ti... [13:18:20] Daimona: I have merged it at the beginning of swat because I knew it would take a lot of time :) [13:18:39] Indeed :D [13:18:40] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:429442|Enable RCPatrol in cswiki (T193242)]] (duration: 00m 59s) [13:18:42] Urbanecm: 429442 is deployed [13:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:47] T193242: Enable $wgUseRCPatrol in cswiki - https://phabricator.wikimedia.org/T193242 [13:18:51] zeljkof, thank you! [13:19:30] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:19:36] (03CR) 10Zfilipin: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:19:40] (03PS2) 10Zfilipin: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:19:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:19:50] (03PS1) 10Volans: DHCP/netboot: add entries for debmonitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/429798 (https://phabricator.wikimedia.org/T191299) [13:21:08] !log beginning rolling reimage of kafka200[23] to stretch T192832 [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] T192832: Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832 [13:21:15] (03Merged) 10jenkins-bot: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:21:46] Woah, finally [13:22:04] Urbanecm: 429668 is at mwdebug [13:22:14] (03CR) 10jenkins-bot: Enable flood flag on sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429668 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:22:16] (03CR) 10Volans: [C: 032] DHCP/netboot: add entries for debmonitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/429798 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:22:17] Daimona: just in time, I am finishing other deployments [13:22:30] Nice [13:22:53] Now, after deployment, I'll wait for a vandalism and will let you know if there is any problem [13:23:09] ack [13:24:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429799 [13:24:20] zeljkof, I need to do a follow-up [13:24:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429799 (owner: 10Marostegui) [13:24:28] Urbanecm: ok, [13:24:34] (03PS1) 10Jcrespo: mariadb: Move db1069 from s7 to x1 (while still full depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) [13:24:38] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429799 (owner: 10Marostegui) [13:25:02] (03PS2) 10Jcrespo: mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) [13:26:04] !log Stop MySQL on db1098 - T193331 [13:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:08] T193331: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331 [13:26:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka2001 is CRITICAL: 44.83 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2001 [13:27:27] (03PS1) 10Urbanecm: Allow bureaucrats to remove flood group for real, allow flooders to strip the group from them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429801 (https://phabricator.wikimedia.org/T193350) [13:27:36] zeljkof, please review&merge ^^^^ [13:27:44] 429801 in particular [13:28:02] Urbanecm: will do, just to deploy 429570 [13:28:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka2003 is CRITICAL: 36.4 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2003 [13:28:05] oh neeeded to silence that alert [13:28:06] sorry [13:28:08] so the Kafka broker alarm should be due to an upcoming reimage [13:28:10] will add to procedure [13:28:11] ah yes :) [13:28:18] i did it last week but forgot to add! [13:29:21] zeljkof, ack [13:29:39] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka2001 is CRITICAL: 53.79 ge 10 ottomata T192832 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2001 [13:29:39] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on kafka2003 is CRITICAL: 45.67 ge 10 ottomata T192832 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2003 [13:30:17] ottomata / elukey: were those kafka alerts related to my elasticsearch cluster restart? [13:30:17] !log Poweroff db1098 for HW maintenance - T193331 [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] !log zfilipin@tin Synchronized php-1.32.0-wmf.1/extensions/AbuseFilter: SWAT: [[gerrit:429570|Dont use an empty string for block parameters (T189681)]] (duration: 01m 02s) [13:30:29] gehel: no [13:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:33] T189681: Actions are marked as red even if unchanged - https://phabricator.wikimedia.org/T189681 [13:30:37] Daimona: 429570 is deployed, please test and thanks for deploying with #releng ;) [13:30:38] i'm reimaging codfw brokers to stretch [13:30:41] i should ahve silenced those [13:30:46] (03PS1) 10ArielGlenn: provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) [13:30:54] those are the other brokers alerting that the a replica is missing [13:30:55] zeljkof thanks! I'll let you know [13:31:12] ottomata: Ok, thanks! I did not understand how there would be a link, so happy to see there isn't :) [13:31:17] (03CR) 10jerkins-bot: [V: 04-1] provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) (owner: 10ArielGlenn) [13:31:25] Daimona: hm, just notice this, not sure if it was around earlier: `Undefined index: 0 in /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php on line 1498` [13:31:34] Yeah [13:31:41] There's a separate patch for that [13:31:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 37.70, 34.39, 32.39 [13:31:48] Which still needs to be merged in master [13:31:51] Daimona: ok [13:32:08] Urbanecm: did you add patches to calendar? [13:32:15] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429803 [13:32:16] Not the followup, should I? [13:32:51] (03PS1) 10Jcrespo: mariadb: Move db1069 from s7 to x1 and enable its reimage [puppet] - 10https://gerrit.wikimedia.org/r/429804 (https://phabricator.wikimedia.org/T186320) [13:32:52] Urbanecm: please do [13:33:05] zeljkof, as you wish: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1789819&oldid=1789808 [13:34:00] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429801 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:34:04] Preliminar checks look good so I think everything is fine [13:35:01] (03PS7) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [13:35:14] (03Merged) 10jenkins-bot: Allow bureaucrats to remove flood group for real, allow flooders to strip the group from them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429801 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:35:57] (03PS2) 10ArielGlenn: provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) [13:36:06] Urbanecm: thanks, it's easier for me to have patches in one place, 429801 is at mwdebug [13:36:16] I finally understand [13:36:33] Is the follow-up patch at mwdebug together with the patch before? [13:36:43] *totally [13:36:47] Wrong word :D [13:37:35] Urbanecm: yes, both patches are merged into master and mwdebug1002 is synced to include both [13:37:48] Ok. They work, please deploy them. [13:37:54] Urbanecm: deploying [13:37:58] ack [13:39:04] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:429801|Allow bureaucrats to remove flood group for real, allow flooders to strip the group from them (T193350)]] (duration: 00m 59s) [13:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:08] T193350: Activation of flood flag on www.wikisource.org - https://phabricator.wikimedia.org/T193350 [13:39:17] Urbanecm: deployed [13:39:23] thank you! [13:39:41] (03PS2) 10Jcrespo: mariadb: Move db1069 from s7 to x1 and enable its reimage [puppet] - 10https://gerrit.wikimedia.org/r/429804 (https://phabricator.wikimedia.org/T186320) [13:40:30] nothing more for swat, so... [13:40:34] (03CR) 10Marostegui: [C: 031] mariadb: Move db1069 from s7 to x1 and enable its reimage [puppet] - 10https://gerrit.wikimedia.org/r/429804 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [13:40:35] !log EU SWAT finished [13:40:37] (03CR) 10jenkins-bot: Allow bureaucrats to remove flood group for real, allow flooders to strip the group from them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429801 (https://phabricator.wikimedia.org/T193350) (owner: 10Urbanecm) [13:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429803 (owner: 10Marostegui) [13:42:04] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429803 (owner: 10Marostegui) [13:42:07] (03PS3) 10Jcrespo: mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) [13:42:08] (03PS1) 10Jcrespo: mariadb: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429805 (https://phabricator.wikimedia.org/T186320) [13:42:45] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1069 from s7 to x1 and enable its reimage [puppet] - 10https://gerrit.wikimedia.org/r/429804 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [13:43:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 after alter table (duration: 00m 59s) [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:52] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 38.66, 33.24, 32.19 [13:44:23] (03PS3) 10ArielGlenn: provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) [13:46:32] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429803 (owner: 10Marostegui) [13:49:11] PROBLEM - Check size of conntrack table on analytics1050 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [13:49:26] (03PS4) 10ArielGlenn: provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) [13:50:05] (03CR) 10ArielGlenn: [C: 032] provide all scripts for dumps misc crons on testbed host [puppet] - 10https://gerrit.wikimedia.org/r/429802 (https://phabricator.wikimedia.org/T161509) (owner: 10ArielGlenn) [13:52:08] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: deploy the service in production - https://phabricator.wikimedia.org/T191299#4167977 (10Volans) Setup DNS, DHCP, netboot and created 2 VMs on Ganeti: `debmonitor[12]001`. [13:54:26] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1050 is CRITICAL: NRPE: Command check_ferm_active not defined [13:55:17] PROBLEM - Check whether ferm is active by checking the default input chain on ganeti2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [13:57:17] RECOVERY - Check whether ferm is active by checking the default input chain on ganeti2008 is OK: OK ferm input default policy is set [13:57:43] (03PS6) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [14:00:07] (03CR) 10Gehel: [C: 032] wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [14:02:57] PROBLEM - Hadoop DataNode on analytics1050 is CRITICAL: NRPE: Command check_hadoop-hdfs-datanode not defined [14:03:22] reimaging --^ [14:17:38] (03PS4) 10Jcrespo: mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) [14:17:40] (03PS2) 10Jcrespo: mariadb: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429805 (https://phabricator.wikimedia.org/T186320) [14:17:46] !log rolling restart blazegraph on all wdqs nodes for new configuration - T192759 [14:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:05] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#4168039 (10fgiunchedi) I've put together a sample dashboard to play around with some concepts/ideas emerged in this task at https://grafana.wikimedia.org/dashboard/db/dash... [14:18:56] (03PS1) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) [14:19:02] (03CR) 10Jcrespo: [C: 032] Add mysql.py wrapper [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/429654 (owner: 10Jcrespo) [14:19:16] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:19:25] (03CR) 10jerkins-bot: [V: 04-1] varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:20:35] (03Merged) 10jenkins-bot: mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:20:51] (03CR) 10jenkins-bot: mariadb: Move db1069 from s7 to x1 (while still fully depooled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429800 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:21:09] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429805 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:21:32] (03PS4) 10Elukey: role::druid::analytics::worker: upgrade Druid to 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [14:22:35] (03Merged) 10jenkins-bot: mariadb: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429805 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:23:22] (03PS1) 10Jcrespo: mysql80: Add first stable mysql package [software] - 10https://gerrit.wikimedia.org/r/429811 (https://phabricator.wikimedia.org/T193226) [14:25:07] (03PS2) 10Jcrespo: mysql80: Add first stable mysql package [software] - 10https://gerrit.wikimedia.org/r/429811 (https://phabricator.wikimedia.org/T193226) [14:25:09] (03PS1) 10Jcrespo: mariadb: Move db1069 from s7 to x1 [software] - 10https://gerrit.wikimedia.org/r/429812 (https://phabricator.wikimedia.org/T186320) [14:25:10] RECOVERY - Hadoop DataNode on analytics1050 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:25:29] RECOVERY - Check size of conntrack table on analytics1050 is OK: OK: nf_conntrack is 0 % full [14:25:39] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1050 is OK: OK ferm input default policy is set [14:25:42] (03CR) 10Jcrespo: [V: 032 C: 032] mysql80: Add first stable mysql package [software] - 10https://gerrit.wikimedia.org/r/429811 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [14:26:05] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1069 from s7 to x1 [software] - 10https://gerrit.wikimedia.org/r/429812 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:26:34] !log Power off db2081 for HW maintenance - T193325 [14:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:38] T193325: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325 [14:26:58] (03CR) 10jenkins-bot: mariadb: Depool db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429805 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [14:27:16] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: upgrade Druid to 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [14:27:44] !log upgrade druid on druid100[1-3] from 0.9.2 to 0.10 [14:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Move db1069 from s7 to x1, depool db1056 (duration: 00m 59s) [14:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:37] !log shutting down db1056 for upgrade/maintenance and cloning [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:53] (03PS2) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) [14:33:21] (03CR) 10jerkins-bot: [V: 04-1] varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:36:21] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/429240 (owner: 10Herron) [14:36:23] (03PS3) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) [14:37:04] (03PS1) 10Cmjohnson: Removing dns entries for db1039 [dns] - 10https://gerrit.wikimedia.org/r/429814 (https://phabricator.wikimedia.org/T184262) [14:38:18] PROBLEM - Host db2081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:13] ^ok, maintenance in process [14:40:28] yeah, expected :) [14:41:56] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4168094 (10chasemp) >>! In T193196#4164325, @Cmjohnson wrote: > @chasemp Confirmed both are 10G w/2 nics, labnet1004 can go to B2...I do not currently have any labnet ser... [14:46:10] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for db1039 [dns] - 10https://gerrit.wikimedia.org/r/429814 (https://phabricator.wikimedia.org/T184262) (owner: 10Cmjohnson) [14:47:09] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#4168111 (10Cmjohnson) [14:48:48] RECOVERY - Host db2081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [14:49:35] (03PS1) 10Marostegui: db2081: Reenable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429816 (https://phabricator.wikimedia.org/T193325) [14:49:54] (03CR) 10Marostegui: [C: 04-1] "Wait till maintenance is finished" [puppet] - 10https://gerrit.wikimedia.org/r/429816 (https://phabricator.wikimedia.org/T193325) (owner: 10Marostegui) [14:53:09] (03PS1) 10Volans: Debmonitor: add dummy MySQL password to hiera [labs/private] - 10https://gerrit.wikimedia.org/r/429818 (https://phabricator.wikimedia.org/T192875) [14:55:04] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4168131 (10Papaul) switch port information asw-a6-codfw ge-6/0/10 [14:56:47] (03CR) 10Volans: [V: 032 C: 032] "Equivalent of the private repo one." [labs/private] - 10https://gerrit.wikimedia.org/r/429818 (https://phabricator.wikimedia.org/T192875) (owner: 10Volans) [14:58:55] (03PS2) 10Filippo Giunchedi: k8s: simplify prometheus alerts with recording rules [puppet] - 10https://gerrit.wikimedia.org/r/429416 (https://phabricator.wikimedia.org/T193186) [15:02:46] 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3077335 (10herron) Hi @bbogaert For sure, here is the ldap config used by the MX servers https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/exi... [15:04:13] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164229 (10Pchelolo) > @mobrovac do you know if LocalRenameUserJob jobs on meta (and only there... [15:04:33] (03PS8) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [15:04:51] (03PS1) 10Cmjohnson: Removing dns entry wmf3565 [dns] - 10https://gerrit.wikimedia.org/r/429820 (https://phabricator.wikimedia.org/T190225) [15:05:49] (03CR) 10Cmjohnson: [C: 032] Removing dns entry wmf3565 [dns] - 10https://gerrit.wikimedia.org/r/429820 (https://phabricator.wikimedia.org/T190225) (owner: 10Cmjohnson) [15:06:24] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168176 (10alanajjar) >>! In T193254#4165761, @1997kB wrote: > [[https://meta.wikimedia.org/wik... [15:07:21] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4168184 (10Cmjohnson) [15:09:26] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4168193 (10Vgutierrez) After running several small captures (10 minutes lapses over 2 days), we've got the following results: * 56% MiTM victims * 32% deprecated human-operat... [15:09:53] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4168195 (10Cmjohnson) [15:10:12] PROBLEM - configured eth on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:21] PROBLEM - nutcracker port on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:31] PROBLEM - Check systemd state on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:10:57] being reimaged [15:11:01] PROBLEM - Check size of conntrack table on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:11:01] PROBLEM - dhclient process on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:11:01] PROBLEM - Check whether ferm is active by checking the default input chain on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:11:01] PROBLEM - nutcracker process on snapshot1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:13:13] (03PS4) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) [15:13:23] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler02/11075/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [15:15:33] (03PS3) 10Gehel: wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) [15:16:14] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4168203 (10Papaul) a:05Papaul>03Marostegui 1- power disconnection+connection 2- update BIOS from 2.5.5 to 2.7.1 3- update IDRAC from 2.50 to 2.52 [15:16:21] RECOVERY - nutcracker port on snapshot1006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:16:21] RECOVERY - configured eth on snapshot1006 is OK: OK - interfaces up [15:16:32] RECOVERY - Check systemd state on snapshot1006 is OK: OK - running: The system is fully operational [15:17:01] RECOVERY - Check whether ferm is active by checking the default input chain on snapshot1006 is OK: OK ferm input default policy is set [15:17:01] RECOVERY - Check size of conntrack table on snapshot1006 is OK: OK: nf_conntrack is 0 % full [15:17:02] RECOVERY - dhclient process on snapshot1006 is OK: PROCS OK: 0 processes with command name dhclient [15:17:02] RECOVERY - nutcracker process on snapshot1006 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [15:20:55] (03PS1) 10ArielGlenn: use php7.0 for all dumps-related things on snapshot1006 [puppet] - 10https://gerrit.wikimedia.org/r/429822 (https://phabricator.wikimedia.org/T181029) [15:21:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4168232 (10Marostegui) Thanks @Papaul - I have started MySQL to let it replicate for a couple of days before closing this. I will leave the host depool... [15:22:01] (03CR) 10Marostegui: [C: 032] db2081: Reenable notifications [puppet] - 10https://gerrit.wikimedia.org/r/429816 (https://phabricator.wikimedia.org/T193325) (owner: 10Marostegui) [15:23:40] (03PS1) 10Cmjohnson: remove site.pp & dhcpd entry graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/429823 (https://phabricator.wikimedia.org/T187190) [15:24:13] (03CR) 10ArielGlenn: [C: 032] use php7.0 for all dumps-related things on snapshot1006 [puppet] - 10https://gerrit.wikimedia.org/r/429822 (https://phabricator.wikimedia.org/T181029) (owner: 10ArielGlenn) [15:24:20] (03PS2) 10ArielGlenn: use php7.0 for all dumps-related things on snapshot1006 [puppet] - 10https://gerrit.wikimedia.org/r/429822 (https://phabricator.wikimedia.org/T181029) [15:25:08] (03PS2) 10Cmjohnson: remove site.pp & dhcpd entry graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/429823 (https://phabricator.wikimedia.org/T187190) [15:25:12] (03PS1) 10Jcrespo: mariadb: Repool db1056 and db1069 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429824 (https://phabricator.wikimedia.org/T186320) [15:26:29] (03CR) 10Cmjohnson: [C: 032] remove site.pp & dhcpd entry graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/429823 (https://phabricator.wikimedia.org/T187190) (owner: 10Cmjohnson) [15:27:41] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1056 and db1069 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429824 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:28:01] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168262 (10mobrovac) There doesn't seem to be anything wrong with the transport mechanism: the... [15:28:38] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168264 (10mobrovac) p:05Unbreak!>03High [15:29:25] (03Merged) 10jenkins-bot: mariadb: Repool db1056 and db1069 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429824 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:29:52] (03CR) 10jenkins-bot: mariadb: Repool db1056 and db1069 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429824 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:30:02] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 36.24, 32.53, 32.11 [15:30:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka2001 is OK: (C)10 ge (W)5 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2001 [15:30:17] (03PS1) 10Imarlier: Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) [15:30:28] (03PS1) 10Jcrespo: mariadb: Fully pool back db1056 and db1069 as x1 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429826 (https://phabricator.wikimedia.org/T186320) [15:31:16] _joe_: Did you see https://gerrit.wikimedia.org/r/429662 yet? [15:31:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 and db1069 with low load (duration: 00m 59s) [15:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:43] <_joe_> hoo: not likely, I was off most of last week and I'm catching up [15:32:54] <_joe_> and leaving again this evening until next monday [15:33:36] <_joe_> hoo: so since the dispatcher runs for an extended time, it might benefit from enabling the jit compiler [15:33:47] <_joe_> I'd try that first [15:34:08] <_joe_> but if you need to solve an emergency before I come back, please just sync with the DBAs [15:34:14] (03CR) 10Imarlier: "@Ottomata I suspect that the ZeroMQ role can be removed altogether, but I didn't want to do that in case there's knock-on effects that I d" [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [15:34:37] _joe_: How can we do that? Passing an env var? [15:34:58] <_joe_> hoo: you can pass PHP="hhvm -o ..." to enable it [15:35:19] <_joe_> I have to search the specific variable, but wait a sec, there is a task that might give you hints [15:35:30] Cool, thanks :) [15:35:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4168279 (10Cmjohnson) I disabled ge-1/0/1 graphite1002 for decom...i did this on both switches. The port labels were not changed. [15:36:19] <_joe_> hoo: https://phabricator.wikimedia.org/T191921#4150546 [15:39:36] (03PS1) 10Alexandros Kosiaris: mobileapps: Add contactgroup for mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/429827 [15:40:45] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429828 (https://phabricator.wikimedia.org/T192979) [15:40:50] (03CR) 10Dzahn: [C: 031] wmf-auto-reimage: increase timeout for Puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/429738 (owner: 10Volans) [15:41:33] (03Abandoned) 10Dzahn: phabricator: make dumps server configurable, rsync to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/428540 (https://phabricator.wikimedia.org/T188149) (owner: 10Dzahn) [15:42:59] (03PS1) 10Hoo man: Run enable HHVM's JIT for Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/429829 (https://phabricator.wikimedia.org/T193349) [15:43:42] (03PS2) 10Hoo man: Enable HHVM's JIT for Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/429829 (https://phabricator.wikimedia.org/T193349) [15:44:05] _joe_: ^ Getting this out soon would be very appreciated [15:44:20] <_joe_> hoo: give me 5 minutes [15:44:39] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168308 (10TonyBallioni) @mobrovac: the reason there is only one stuck currently is likely beca... [15:44:40] (03CR) 10Jcrespo: [C: 032] mariadb: Fully pool back db1056 and db1069 as x1 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429826 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:45:54] (03Merged) 10jenkins-bot: mariadb: Fully pool back db1056 and db1069 as x1 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429826 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:46:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 36.74, 32.36, 32.28 [15:46:06] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1069 [puppet] - 10https://gerrit.wikimedia.org/r/429828 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:47:13] (03CR) 10Ottomata: [C: 031] Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [15:48:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 35.08, 32.68, 32.36 [15:49:43] (03CR) 10jenkins-bot: mariadb: Fully pool back db1056 and db1069 as x1 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429826 (https://phabricator.wikimedia.org/T186320) (owner: 10Jcrespo) [15:49:46] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4168342 (10Joe) Dumps are already partially running on php 7 and have been thoroughly tested in the past months, so I'd leav... [15:49:54] <_joe_> can someone look at mw1233? [15:50:32] (03PS3) 10Giuseppe Lavagetto: Enable HHVM's JIT for Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/429829 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [15:51:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable HHVM's JIT for Wikidata dispatchers [puppet] - 10https://gerrit.wikimedia.org/r/429829 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [15:52:16] <_joe_> hoo:^^ done; I'll run puppet on terbium too [15:52:28] Cool… let's see what that gives :) [15:53:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 35.64, 32.82, 32.34 [15:53:19] <_joe_> hoo: as an alternative, we can install the terbium replacement and run the dispatcher from there with php7 ^_^ [15:53:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 and db1069 with full weight (duration: 00m 59s) [15:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:37] <_joe_> hoo: puppet has run, let me know if you see any issue [15:53:50] _joe_: Heh, I'd love to move there ASAP :) [15:53:52] <_joe_> I've got to hop into a meeting now [15:54:15] (03PS4) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) [15:59:03] (03CR) 10Dzahn: "awww. really? this looks too easy :) why did i not see that. thanks! i'll compile it after next meeting" [puppet] - 10https://gerrit.wikimedia.org/r/429827 (owner: 10Alexandros Kosiaris) [15:59:07] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168356 (10Tgr) >>! In T193254#4168262, @mobrovac wrote: > There is only one stuck job now and... [16:00:59] 10Operations: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4168359 (10Vgutierrez) [16:01:11] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168369 (10revi) >>! In T193254#4168262, @mobrovac wrote: > > There is only one stuck job now a... [16:02:21] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168381 (10Tgr) Anyway, fixed Ajh98/Nqtema with the script. Is there a per-jobtype debug loggi... [16:02:57] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168383 (10alanajjar) >>! In T193254#4168381, @Tgr wrote: > Anyway, fixed Ajh98/Nqtema with the... [16:05:54] (03Abandoned) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [16:07:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few inline comments, rest LGTM" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [16:08:10] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 33.11, 32.41, 32.03 [16:10:26] (03PS2) 10Alexandros Kosiaris: mobileapps: Add contactgroup for mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/429827 (https://phabricator.wikimedia.org/T189524) [16:11:40] (03PS1) 10Ema: varnishmedia: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) [16:12:19] (03CR) 10jerkins-bot: [V: 04-1] varnishmedia: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [16:13:07] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4168426 (10alanajjar) >>! In T193254#4168383, @alanajjar wrote: >>>! In T193254#4168381, @Tgr w... [16:14:33] (03CR) 10Imarlier: Make webperf role install coal things (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [16:14:39] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4168429 (10demon) a:03demon I'll handle this. Should just be a domain swap--no need to bother doing renames... [16:14:48] (03PS10) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [16:15:20] (03CR) 10Ema: "Is there any other dashboard except for https://grafana-admin.wikimedia.org/dashboard/db/media?refresh=5m&orgId=1&panelId=21&fullscreen&ed" [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [16:17:12] (03CR) 10Alexandros Kosiaris: [C: 032] mobileapps: Add contactgroup for mobileapps [puppet] - 10https://gerrit.wikimedia.org/r/429827 (https://phabricator.wikimedia.org/T189524) (owner: 10Alexandros Kosiaris) [16:17:18] (03CR) 10Alexandros Kosiaris: [C: 032] Add mobileapps to contacts for mobileapps LVS service [puppet] - 10https://gerrit.wikimedia.org/r/425991 (https://phabricator.wikimedia.org/T189524) (owner: 10Alexandros Kosiaris) [16:17:22] (03PS3) 10Alexandros Kosiaris: Add mobileapps to contacts for mobileapps LVS service [puppet] - 10https://gerrit.wikimedia.org/r/425991 (https://phabricator.wikimedia.org/T189524) [16:18:30] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168435 (10Tgr) [16:19:28] 10Operations, 10GlobalRename, 10JobRunner-Service, 10MediaWiki-JobQueue, and 3 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4164229 (10Tgr) [16:20:29] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 4 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168440 (10Tgr) [16:21:51] (03PS1) 10Dzahn: Revert "icinga: add contactgroup for mobileapps to Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/429834 [16:21:56] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 4 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168443 (10Pchelolo) > The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabeti... [16:22:17] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/429827/ says "don't use Hiera for this", so revert my attempt?" [puppet] - 10https://gerrit.wikimedia.org/r/429834 (owner: 10Dzahn) [16:23:04] (03PS11) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [16:23:37] (03PS2) 10Dzahn: Revert "icinga: add contactgroup for mobileapps to Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/429834 [16:27:54] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168446 (10mobrovac) >>! In T193254#4168443, @Pchelolo wrote: > I believe this is the case. When switching the job to Kafka it was done only for test wikis a... [16:28:07] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168448 (10mobrovac) [16:32:28] (03PS1) 10Ppchelko: Switch LocalRenameUserJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429836 (https://phabricator.wikimedia.org/T193254) [16:33:01] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 6 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168457 (10fdans) p:05High>03Triage [16:34:24] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 6 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168466 (10Milimetric) p:05Triage>03High sorry - reverting accidental change of priority [16:34:53] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4168470 (10ema) @Krinkle I've pushed https://gerrit.wikimedia.org/r/429833 to remove varnishmedia, my understanding is that there's only [[ https://grafa... [16:35:38] (03CR) 10Ema: [C: 032] Add n_lru_limited counter [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429394 (owner: 10Ema) [16:35:46] (03CR) 10Ema: [C: 032] Add cache_hit_grace counter [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429395 (https://phabricator.wikimedia.org/T192368) (owner: 10Ema) [16:35:55] (03CR) 10Ema: [C: 032] Ignore req.ttl when keeping track of expired objects [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429762 (owner: 10Ema) [16:36:04] (03CR) 10Ema: [C: 032] Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 (owner: 10Ema) [16:37:08] ema is on a merge spree! (read with unreal tournament voice) [16:38:00] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4168473 (10akosiaris) I think the 2 patches above have fixed the both issues. mobileapps team will get notifications for all ser... [16:38:23] vgutierrez: thanks for the good memory (unreal tournament) [16:39:58] tgr: ping? [16:42:40] heh [16:43:48] (03CR) 10Mobrovac: [C: 032] Switch LocalRenameUserJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429836 (https://phabricator.wikimedia.org/T193254) (owner: 10Ppchelko) [16:44:00] * mobrovac taking over tin for 10 mins [16:45:14] (03Merged) 10jenkins-bot: Switch LocalRenameUserJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429836 (https://phabricator.wikimedia.org/T193254) (owner: 10Ppchelko) [16:46:56] (03CR) 10Dzahn: [C: 032] Revert "icinga: add contactgroup for mobileapps to Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/429834 (owner: 10Dzahn) [16:47:15] akosiaris: ^ i'm reverting mine that used HIera [16:47:30] mutante: ok cool. thanks! [16:47:34] and thanks ! [16:47:56] (03PS1) 10Ema: 5.1.3-1wm8: add patches included in 4.1.10 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429839 (https://phabricator.wikimedia.org/T192368) [16:49:38] (03CR) 10jenkins-bot: Switch LocalRenameUserJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429836 (https://phabricator.wikimedia.org/T193254) (owner: 10Ppchelko) [16:49:39] !log ppchelko@tin Started deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254 [16:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:45] T193254: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254 [16:50:28] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254 (duration: 00m 49s) [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:32] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch LocalRenameUserJob to EventBus for all wikis - T193254 T190327 (duration: 00m 59s) [16:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:37] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [16:51:29] (03PS9) 10Dzahn: icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) [16:51:45] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4168570 (10Dzahn) I confirmed that on einsteinium, the services have the right contact groups now. Reverted my own change that a... [16:51:50] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4168571 (10Dzahn) 05Open>03Resolved [16:51:57] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4168574 (10RobH) 05stalled>03declined a:05Lucas_Werkmeister_WMDE>03None Since this has been stalled for nearly two weeks, I'm going to go ahead and close it as decl... [16:53:14] 10Operations, 10Ops-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4168582 (10RobH) a:03Deskana @Deskana: I'm going to assign this to you directly, as it appears to be awaiting your feedback. Once you have given it, and it is ready for SRE re... [16:53:33] robh: you're right. better to not keep it stalled for weeks. reopening is easy [16:53:46] and it had that subtask [16:54:39] (03CR) 10Dzahn: [C: 032] "thanks for reviews. mentioned in ops meeting as well. now using the new commands for all" [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [16:57:40] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168614 (10mobrovac) 05Open>03Resolved a:03mobrovac We have switched the LocalRenameUserJob for all wikis to EventBus, so we don't anticipate any probl... [16:58:38] (03PS1) 10Cmjohnson: Removing dns entries graphite1002 [dns] - 10https://gerrit.wikimedia.org/r/429840 (https://phabricator.wikimedia.org/T187190) [16:59:04] (03CR) 10Cmjohnson: [C: 032] Removing dns entries graphite1002 [dns] - 10https://gerrit.wikimedia.org/r/429840 (https://phabricator.wikimedia.org/T187190) (owner: 10Cmjohnson) [16:59:15] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4168627 (10RobH) [16:59:40] yeah seemed better to get it off board [16:59:48] so no visual fatigue and no one overlooks it if reopened [17:00:00] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4168632 (10Cmjohnson) [17:00:04] gehel: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T1700). [17:03:25] jouncebot: o/ [17:03:42] (03PS12) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [17:04:15] (03PS1) 10Andrew Bogott: labtestwikitech: use eqiad db host (db1073) even from codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) [17:05:19] (03PS1) 10Dzahn: icinga: remove test-rob from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/429842 (https://phabricator.wikimedia.org/T185862) [17:05:30] (03CR) 10jerkins-bot: [V: 04-1] labtestwikitech: use eqiad db host (db1073) even from codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [17:05:40] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#4168673 (10Volans) As discussed in the monitoring meeting here some feedback: - while the limit on the number of rows/panels/metrics is understandable, it could make hard... [17:06:12] (03PS2) 10Dzahn: icinga: remove test-rob from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/429842 (https://phabricator.wikimedia.org/T185862) [17:06:19] (03PS3) 10Dzahn: icinga: remove test-robh from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/429842 (https://phabricator.wikimedia.org/T185862) [17:07:13] (03PS2) 10Andrew Bogott: labtestwikitech: use eqiad db host (db1073) even from codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) [17:08:37] (03CR) 10Dzahn: [C: 032] icinga: remove test-robh from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/429842 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [17:09:16] (03CR) 10Jcrespo: "Let me think about this- what kind of data is handled here? Is there any private data on labstestwiki at all (users, ips)? If yes, it woul" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [17:09:31] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [17:09:46] awww. looking ^ [17:10:43] !log removing stale scap log for wdqs on tin.eqiad.wmnet [17:10:43] (03CR) 10Andrew Bogott: "labtestwiki does have logins and passwords but they're separate accounts from those used on wikitech. The wiki is only used by staff and " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [17:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:03] !log gehel@tin Started deploy [wdqs/wdqs@2579bfa]: deploying wdqs gui [17:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:23] 10Operations, 10Mail: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4168707 (10herron) [17:15:20] !log gehel@tin Finished deploy [wdqs/wdqs@2579bfa]: deploying wdqs gui (duration: 04m 16s) [17:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:17] SMalyshev: deploy completed, tests are green [17:18:29] icinga config issue fixed [17:18:43] (03PS1) 10Gilles: Reafactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 [17:19:33] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:20:16] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: icinga ACK shows as CRIT when delivered via SMS - https://phabricator.wikimedia.org/T185862#4168723 (10Dzahn) 05Open>03Resolved - new commands were tested for a while - all contacts are now using the new notification command definitions - test u... [17:20:57] (03CR) 10Zhuyifei1999: [C: 031] ruby: install libmysqlclient-dev package in the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/429758 (https://phabricator.wikimedia.org/T192566) (owner: 10Arturo Borrero Gonzalez) [17:21:16] gehel: thank you! [17:22:35] (03CR) 10Andrew Bogott: [C: 031] ruby: install libmysqlclient-dev package in the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/429758 (https://phabricator.wikimedia.org/T192566) (owner: 10Arturo Borrero Gonzalez) [17:22:52] (03CR) 10Smalyshev: "Not sure we need it for Updater - the only really multithreaded part there is Wikibase download, and it shouldn't consume a lot of memory." [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) (owner: 10Gehel) [17:22:59] (03PS2) 10Gilles: Reafactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 [17:23:11] (03CR) 10Smalyshev: [C: 031] wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) (owner: 10Gehel) [17:24:22] 10Operations, 10Mail: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4168359 (10Reedy) I don't think other domains are used for sending emails. And I'm guessing wikipedia.org is probably only mostly used with OTRS. Seems worthwhile doing as a hardening measure, for sure. Are any... [17:24:23] (03PS4) 10Gehel: wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) [17:24:48] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168747 (10cwdent) SSL certs are what allow your browser to show you a green bar and guarantee that if you see that, you are talking to the Wikimedia Fo... [17:25:31] (03CR) 10Gehel: [C: 032] wdqs: enable UseNUMA on blazegraph and updater [puppet] - 10https://gerrit.wikimedia.org/r/429552 (https://phabricator.wikimedia.org/T193365) (owner: 10Gehel) [17:26:39] !log restart blazegraph and updater on wdqs1003 to activate UseNUMA -T193365 [17:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:44] T193365: Evaluate using NUMA for Blazegraph - https://phabricator.wikimedia.org/T193365 [17:27:07] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168760 (10Ejegg) cwdent we formerly had silverpop-hosted urls in the email links, and lots of people thought they were phishing spam [17:27:35] (03CR) 10Imarlier: "Puppet compiler run looks right to me: https://puppet-compiler.wmflabs.org/compiler02/11078/" [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:28:53] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4168775 (10CCogdill_WMF) We used a Silverpop URL for a few months and got enough complaints from donors that our Donor Services team asked us to turn cl... [17:29:55] @system_administrators Is it posssible to recover an account if it has a Yahoo or Hotmail email address? [17:31:31] (03CR) 10Dzahn: "this is separate from this change, but i would like it if we can start breaking up that line into one line per virtual host. it would make" [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [17:31:40] (03PS4) 10Dzahn: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [17:34:07] (03CR) 10Dzahn: [C: 032] idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [17:37:15] (03PS3) 10ArielGlenn: generate checksums on a per job basis, updating the hash as needed [dumps] - 10https://gerrit.wikimedia.org/r/429245 [17:38:01] (03CR) 10ArielGlenn: [C: 032] generate checksums on a per job basis, updating the hash as needed [dumps] - 10https://gerrit.wikimedia.org/r/429245 (owner: 10ArielGlenn) [17:38:46] (03CR) 10Gilles: [C: 031] Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:39:06] !log awight@tin Started deploy [ores/deploy@8c586ab]: Canary-only test deployment for ORES + git-lfs, T181678 [17:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:10] T181678: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678 [17:39:46] !log ariel@tin Started deploy [dumps/dumps@8398f53]: write checksums of dump files into seperate hashfiles, reusing their contents as appropriate [17:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:49] !log ariel@tin Finished deploy [dumps/dumps@8398f53]: write checksums of dump files into seperate hashfiles, reusing their contents as appropriate (duration: 00m 03s) [17:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:05] !log awight@tin Finished deploy [ores/deploy@8c586ab]: Canary-only test deployment for ORES + git-lfs, T181678 (duration: 01m 59s) [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T1800). [18:00:04] RoanKattouw and Smalyshev: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] here [18:00:19] Hello [18:00:22] I can do the SWAT [18:00:35] great [18:01:36] RoanKattouw: Can I add a patch to SWAT? [18:01:44] Sure! [18:02:07] (03CR) 10Catrope: [C: 032] Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [18:03:32] (03Merged) 10jenkins-bot: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [18:04:26] RoanKattouw: Done. It's on the calendar. Thanks. [18:04:58] !log awight@tin Started deploy [ores/deploy@46824bb]: Canary-only test deployment for ORES + git-lfs, T181678 (take 2) [18:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:02] T181678: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678 [18:06:55] !log awight@tin Finished deploy [ores/deploy@46824bb]: Canary-only test deployment for ORES + git-lfs, T181678 (take 2) (duration: 01m 58s) [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:20] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4150980 (10RobH) @Urbanecm: Please note we'll also need you to review and agree/sign the L3 document on phabricator for... [18:07:50] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4168937 (10RobH) [18:08:20] SMalyshev: On mwdebug1002 now [18:08:39] (03CR) 10Bstorm: "I upgraded the requirements for running the script (in previous commits, I locked it down more recently) so that the configuration is only" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [18:08:50] RoanKattouw: testing [18:10:09] (03PS2) 10Imarlier: Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) [18:10:36] RoanKattouw: hmm something seems to be not working well, please leave it on 1002 but don't deploy further [18:11:03] OK [18:12:33] Niharika: Yours is on mwdebug1002, please test [18:12:34] (03CR) 10Imarlier: "Puppet compiler run: https://puppet-compiler.wmflabs.org/compiler02/11080/" [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [18:12:45] RoanKattouw: looks like there's some problems with internal cluster (bad luck...) so let's revert it for now until we debug what is going on [18:12:49] On it. [18:13:01] OK, reverting [18:13:20] (03PS1) 10Catrope: Revert "Set SPARQL services to use internal cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429851 [18:13:25] (03CR) 10Catrope: [C: 032] Revert "Set SPARQL services to use internal cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429851 (owner: 10Catrope) [18:13:51] RoanKattouw: Works as expected. You can sync. [18:14:55] (03Merged) 10jenkins-bot: Revert "Set SPARQL services to use internal cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429851 (owner: 10Catrope) [18:15:27] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4168948 (10awight) Pilot deployment to the canary server failed, with no errors: {P7055} Looks good, but the large file was never ch... [18:15:48] (03CR) 10jenkins-bot: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [18:15:53] (03CR) 10jenkins-bot: Revert "Set SPARQL services to use internal cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429851 (owner: 10Catrope) [18:16:09] !log starting rolling reimage of kafka main-eqiad brokers kafka100[123] - T192832 [18:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] T192832: Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832 [18:16:15] (03CR) 10Krinkle: "I forget the name and syntax, but there's a spare/standard role or some such SRE typically leaves here when clearing an site.pp entry, whi" [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [18:17:39] !log catrope@tin Synchronized php-1.32.0-wmf.1/extensions/CodeMirror/resources/modules/ve-cm/ve.ui.CodeMirrorAction.js: T191923 (duration: 01m 00s) [18:17:40] (03PS3) 10Catrope: Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) [18:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:43] T191923: Enable CodeMirror in 2017 Wikitext editor in all wikis without the beta feature - https://phabricator.wikimedia.org/T191923 [18:17:45] (03CR) 10Catrope: [C: 032] Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) (owner: 10Catrope) [18:19:12] (03Merged) 10jenkins-bot: Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) (owner: 10Catrope) [18:19:53] PROBLEM - Check that eventlogging-service-eventbus is running on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [18:20:15] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115#1569333 (10EddieGP) Just judging from the task title, this and {T183245} look like being duplicates? [18:20:43] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4168958 (10awight) Maybe we need `git lfs pull --recursive`? No clue why this would have worked on beta without the `--recursive`, ho... [18:21:03] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:21:39] PROBLEM - Kafka Broker Server on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [18:21:54] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4168966 (10EddieGP) >>! In T176370#4168342, @Joe wrote: > I guess we do test the maintenance scripts in beta? We don't: {T1... [18:21:56] whaaa [18:21:58] i downtimed it... [18:22:42] ottomata: all under control? :) [18:22:49] yeah [18:22:49] Quick question -- we're preparing a host (hafnium) to be decommissioned. What role (if any) should I give it in the puppet site.pp, given that it's not yet offline? [18:22:52] Changeset is here: https://gerrit.wikimedia.org/r/#/c/429825/2/manifests/site.pp [18:22:53] i downtimed that very explicitly [18:22:57] no idea why it paged [18:22:58] !log awight@tin Started deploy [ores/deploy@5b27205]: Rollback ORES canary to master [18:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:01] it shows downtimed in icinga [18:23:02] (03PS1) 10Smalyshev: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429853 (https://phabricator.wikimedia.org/T192942) [18:23:08] marlier: role(spare::system) [18:23:19] !log awight@tin Finished deploy [ores/deploy@5b27205]: Rollback ORES canary to master (duration: 00m 21s) [18:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:20] marlier: see https://wikitech.wikimedia.org/wiki/Server_Lifecycle [18:24:43] (03CR) 10jenkins-bot: Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) (owner: 10Catrope) [18:24:59] marlier: here is the checklist template that can be copy/pasted into a decom ticket https://wikitech.wikimedia.org/wiki/Server_Lifecycle/reclaim_checklist the first 5 checkboxes end with "[] - remove site.pp (replace with role::spare::system if system isn't shut down immediately during this process.) " [18:25:22] marlier: then after that step they are handed-over to dc-ops to continue [18:27:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T191584 (duration: 01m 00s) [18:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:24] T191584: Release mapframe to English Wikipedia - https://phabricator.wikimedia.org/T191584 [18:28:41] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4168997 (10awight) \o/ just did a rollback of one machine in 24 seconds, t... [18:30:02] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169008 (10awight) p:05Triage>03High [18:30:19] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169021 (10awight) p:05High>03Normal [18:32:05] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 321 MB (3% inode=74%) [18:32:12] (03PS1) 10Cmjohnson: Removing dns for mobile1004/1005 [dns] - 10https://gerrit.wikimedia.org/r/429855 (https://phabricator.wikimedia.org/T181750) [18:32:39] (03PS3) 10Imarlier: Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) [18:32:45] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169027 (10alanajjar) @mobrovac I think it still the same! see [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Menageross03|here]], the proce... [18:32:47] (03CR) 10Cmjohnson: [C: 032] Removing dns for mobile1004/1005 [dns] - 10https://gerrit.wikimedia.org/r/429855 (https://phabricator.wikimedia.org/T181750) (owner: 10Cmjohnson) [18:32:57] (03CR) 10Imarlier: "> I forget the name and syntax, but there's a spare/standard role or" [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [18:33:14] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169038 (10alanajjar) 05Resolved>03Open [18:37:18] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169044 (10alanajjar) 05Open>03Resolved [18:37:56] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4164240 (10alanajjar) Thanks a lot all [18:38:53] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169059 (10Pchelolo) > As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers... [18:39:22] mutante, volans: Thanks a lot. Created the decom ticket: https://phabricator.wikimedia.org/T193420. There are a number of steps that I don't have permissions for (not allowed to downtime in icinga, for example), so I need to coordinate with someone on this. What's the right way to go about that? Add Ops as a tag and wait until triage gets to it? [18:39:58] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169060 (10alanajjar) Yes @Pchelolo I noticed that now, Thanks again [18:40:15] marlier: i can take that [18:40:29] marlier: I would say add operations and if urgent ping the clinic duty person [18:40:37] marlier: the general answer, yes, add the tag Operations and hardware-requests [18:40:52] think of them as reverse hw-requests [18:41:34] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169064 (10Imarlier) [18:41:50] Works for me, thanks! [18:47:28] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4169093 (10cwdent) @Ejegg @CCogdill_WMF ok scratch that idea :) [18:51:48] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4169095 (10awight) Apparently I'm bad at git, and I failed to commit the right submodule pointers... trying again. [18:53:26] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169101 (10Dzahn) a:05awight>03Dzahn [18:53:52] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169102 (10Dzahn) a:05Imarlier>03Dzahn [18:54:16] 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#4169103 (10bbogaert) Hi @herron , Thanks for sending this along. This is a great help. I'm going to read over this, and see if I can make some sub OU's and see if mail still flows. Also, @Mor... [18:54:22] marlier: can you confirm the " [18:54:22] - all system services confirmed offline from production use [18:54:43] i'll shut them down then [18:55:03] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169104 (10Imarlier) [18:55:04] Yep, all set [18:55:25] I just can't actually stop them, since puppet'll just turn them back on. [18:55:38] But they're running fine on webperf1001 [18:55:56] heh, i see, ok cool [18:56:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decommission mobile 1004 and mobile1005 - https://phabricator.wikimedia.org/T181750#4169107 (10Cmjohnson) [18:57:03] [einsteinium:~] $ sudo icinga-downtime -h hafnium -r "T193420" [18:57:03] T193420: Decommission hafnium - https://phabricator.wikimedia.org/T193420 [18:57:07] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4169108 (10Cmjohnson) Disabled mobile1004 from b4 4/0/16 on both switches.....description remains mobile1004 until removed from rack. [18:57:36] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169110 (10Dzahn) [18:58:39] (03PS4) 10Dzahn: Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [18:58:52] marlier: downtimed in icinga, merging your change [18:59:03] (03CR) 10Dzahn: [C: 032] Remove references to hafnium [puppet] - 10https://gerrit.wikimedia.org/r/429825 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [18:59:06] Brilliant, thank you! [18:59:40] jouncebot: next [18:59:40] In 1 hour(s) and 0 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T2000) [19:00:03] checked that because we are removing it from scap group [19:01:44] !log hafnium - sudo service navtiming stop; sudo service statsv stop - downtimed in icinga, decom [19:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:44] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169115 (10Dzahn) 14:57 < mutante> [einsteinium:~] $ sudo icinga-downtime -h hafnium -r "T193420" 14:58 < mutante> marlier: downtimed in icinga, merging your change 15:01 < mutante> !log... [19:02:56] 10Operations, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169116 (10Dzahn) [19:06:49] (03PS1) 10Dzahn: DHCP/partman: remove hafnium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/429858 (https://phabricator.wikimedia.org/T193420) [19:08:30] !log awight@tin Started deploy [ores/deploy@25579e7]: Trial LFS deployment to ORES canary; T181678 [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] T181678: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678 [19:10:36] !log awight@tin Finished deploy [ores/deploy@25579e7]: Trial LFS deployment to ORES canary; T181678 (duration: 02m 06s) [19:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:23] (03CR) 10Dzahn: [C: 032] DHCP/partman: remove hafnium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/429858 (https://phabricator.wikimedia.org/T193420) (owner: 10Dzahn) [19:14:34] 10Operations, 10ops-eqiad, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169133 (10Dzahn) [19:14:52] 10Operations, 10ops-eqiad, 10Performance-Team, 10hardware-requests: Decommission hafnium - https://phabricator.wikimedia.org/T193420#4168982 (10Dzahn) a:05Dzahn>03None [19:14:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decommission mobile 1004 and mobile1005 - https://phabricator.wikimedia.org/T181750#4169136 (10Cmjohnson) still needs port descriptions updated. [19:15:20] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169138 (10Imarlier) [19:15:46] 10Operations, 10ops-eqiad, 10hardware-requests, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420#4169143 (10Dzahn) From here: @Cmjohnson you can continue on the ticket [19:17:31] marlier: alright, so you did your part and i did mine and from here it will move on https://phabricator.wikimedia.org/tag/ops-eqiad/ [19:17:43] Rockin' [19:17:52] mutante: as always, many thanks! [19:18:29] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049047 (10Imarlier) [19:18:31] i still see an alert for the statsv service but that should be gone after next puppet run , checking [19:18:34] you're welcome [19:25:20] 10Operations, 10Mail: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4169202 (10herron) While we're at it there are many other domains in our control (e.g. the .tld variants of the canonical domains) that we can adjust SPF for as well. Since this involves reviewing the intended... [19:37:11] (03PS1) 10Dzahn: mediawiki/apache: seperate line for each chapter ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/429863 [19:39:54] (03PS2) 10Dzahn: mediawiki/apache: seperate line for each chapter ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/429863 [19:46:49] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4169339 (10Imarlier) a:03Imarlier [19:49:25] 10Operations: http://noboard.chapters.wikimedia.org/ ? - https://phabricator.wikimedia.org/T82116#4169363 (10Dzahn) [19:52:01] (03PS1) 10Ottomata: Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) [19:53:13] (03PS2) 10Ottomata: Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) [19:55:59] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169396 (10Dzahn) @awight I think the correct path is /srv/deployment/ores/venv (vs. /srv/... [19:57:12] (03PS3) 10Ottomata: Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) [19:57:39] (03CR) 10jerkins-bot: [V: 04-1] Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [19:57:52] 10Operations, 10MediaWiki-Platform-Team, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#4169407 (10aaron) [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T2000). [20:03:55] (03CR) 1020after4: [C: 04-1] WIP: phabricator refactor init.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [20:05:09] (03PS2) 1020after4: WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 [20:05:20] !log ppchelko@tin Started deploy [changeprop/deploy@8cd45ed]: Don't filter bots from the ORES stream T187927 [20:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:25] T187927: Drop "non bot" condition from ORES changeprop rules - https://phabricator.wikimedia.org/T187927 [20:06:04] (03CR) 10jerkins-bot: [V: 04-1] WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [20:06:30] (03CR) 10Paladox: "Needs to be updated to support stretch + php7.2" [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [20:06:32] (03PS3) 1020after4: WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 [20:06:35] !log ppchelko@tin Finished deploy [changeprop/deploy@8cd45ed]: Don't filter bots from the ORES stream T187927 (duration: 01m 15s) [20:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:59] !log awight@tin Started deploy [ores/deploy@4601497]: Trial LFS deployment to ORES canary; T181678 (take 2) [20:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:04] T181678: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678 [20:07:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [20:09:09] !log awight@tin Finished deploy [ores/deploy@4601497]: Trial LFS deployment to ORES canary; T181678 (take 2) (duration: 02m 10s) [20:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:19] !log arlolra@tin Started deploy [parsoid/deploy@d8d7b42]: Updating Parsoid to 50b0588 [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:27] (03PS1) 10Dzahn: w.wiki: add SPF record, disallow email [dns] - 10https://gerrit.wikimedia.org/r/429871 (https://phabricator.wikimedia.org/T193408) [20:11:35] !log bsitzmann@tin Started deploy [mobileapps/deploy@d3724d2]: Update mobileapps to cc00cae (T191869) [20:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:39] T191869: Update mobile-sections and summary to source Wikidata descriptions from local wiki where available - https://phabricator.wikimedia.org/T191869 [20:14:13] (03PS2) 10Dzahn: w.wiki: add SPF record, disallow email [dns] - 10https://gerrit.wikimedia.org/r/429871 (https://phabricator.wikimedia.org/T193408) [20:17:29] !log awight@tin Started deploy [ores/deploy@5b27205]: Rollback ores1001 to master [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:07] !log bsitzmann@tin Finished deploy [mobileapps/deploy@d3724d2]: Update mobileapps to cc00cae (T191869) (duration: 07m 32s) [20:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:10] T191869: Update mobile-sections and summary to source Wikidata descriptions from local wiki where available - https://phabricator.wikimedia.org/T191869 [20:19:29] (03PS1) 10Dzahn: add SPF record to disallow email for all parked domains [dns] - 10https://gerrit.wikimedia.org/r/429874 (https://phabricator.wikimedia.org/T193408) [20:20:25] !log awight@tin Finished deploy [ores/deploy@5b27205]: Rollback ores1001 to master (duration: 02m 56s) [20:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:45] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 28 connections established with conf2001.codfw.wmnet:2379 (min=29) [20:21:05] !log arlolra@tin Finished deploy [parsoid/deploy@d8d7b42]: Updating Parsoid to 50b0588 (duration: 09m 46s) [20:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:17] !log awight@tin Started deploy [ores/deploy@bf182e2]: ORES: Include bot edits in precaching wikidata itemquality; T187927 [20:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:21] T187927: Drop "non bot" condition from ORES changeprop rules - https://phabricator.wikimedia.org/T187927 [20:25:16] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4168359 (10Dzahn) >>! In T193408#4169202, @herron wrote: > .. there are many other domains in our control (e.g. the .tld variants of the canonical domains) that we can adjust SPF for as well... [20:25:45] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 29 connections established with conf2001.codfw.wmnet:2379 (min=29) [20:27:25] !log Updated Parsoid to 50b0588 (T186358, T191700, T192909) [20:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:32] T191700: VE: Avoid piping links for formatting where possible - https://phabricator.wikimedia.org/T191700 [20:27:32] T186358: [BUG] Some content is not rendering correctly in the Quick box / Info box on Android web view - https://phabricator.wikimedia.org/T186358 [20:27:33] T192909: Dirty diff and other weird corruption in table edit - https://phabricator.wikimedia.org/T192909 [20:29:36] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4169563 (10Dzahn) Maybe fr-tech should be added to this ticket given that FR does email campaigns using external providers, like silverpop. [20:38:17] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4169582 (10awight) Gave it another try, with commit 4601497c4f4363fcea639dc3e13d4f178c421a1b, and got strange results. The LFS data s... [20:43:25] (03PS1) 10Ppchelko: Switch all jobs for everything except wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) [20:44:09] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169598 (10awight) @Dzahn Thanks, I think you're right on both counts. We must have deploye... [20:49:07] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169633 (10awight) [21:00:04] bawolff and Reedy: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T2100). [21:02:37] (03PS1) 10Catrope: Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) [21:03:31] (03CR) 10Urbanecm: [C: 04-1] "See inline comments." (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [21:04:45] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) (owner: 10MarcoAurelio) [21:17:44] PROBLEM - nova-compute proc minimum on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [21:18:14] PROBLEM - nova-compute proc maximum on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [21:18:17] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4169688 (10EddieGP) a:03Joe Assigning to joe - it seems you're the one most comfortable (or only one comfortable?) on apache changes. Also p... [21:21:41] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169694 (10MarcoAurelio) Do we need to migrate `CentralAuthRename` too? If so, can it be done? Thanks. [21:22:19] 10Operations, 10Gerrit, 10ORES, 10RelEng-Archive-FY201718-Q2, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#4169696 (10awight) [21:24:10] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169698 (10Pchelolo) > Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks. Eventually everything will be migrated. Are you seeing pr... [21:27:09] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169701 (10Tgr) That's a log channel, not a job queue. Other potentially affected jobs are LocalUserMergeJob (not sure if Wikimedia wikis still allow merges)... [21:28:16] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169704 (10Tgr) >>! In T193254#4169701, @Tgr wrote: > LocalPageMoveJob (I think that's triggered differently, not quite sure though). Yes it is. So LocalUse... [21:31:55] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 51.49, 36.46, 31.03 [21:32:00] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169715 (10MarcoAurelio) We are not performing any user account merges nor globally nor locally. Regards. [21:36:18] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169720 (10Tgr) Other instances of cross-wiki job scheduling that are yielded by a quick `ack 'JobQueueGroup::singleton\( '`: Cognate/LocalJobSubmitJob, Mass... [21:46:04] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169741 (10Pchelolo) > Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob,... [21:46:05] !log T192972 increase eqiad elasticsearch disk watermarks from 75/80 to 85/85 [21:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:09] T192972: Evaluate impact of adding ~2700 new shards to production cluster - https://phabricator.wikimedia.org/T192972 [21:46:58] (03CR) 10MarcoAurelio: idwikimedia: initial configuration (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [21:48:56] (03PS4) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [21:49:04] (03PS5) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [21:49:42] (03PS2) 10MarcoAurelio: euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) [21:51:04] (03CR) 10MarcoAurelio: "Visually looks better, but I lack knowledge to vote in this patch." [puppet] - 10https://gerrit.wikimedia.org/r/429863 (owner: 10Dzahn) [22:00:35] (03CR) 10Krinkle: "Find out :) - https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad" [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [22:01:00] (03Draft1) 10MarcoAurelio: cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 [22:01:03] (03PS2) 10MarcoAurelio: cawiki: remove gendered namespace aliases, already on MW core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429989 (https://phabricator.wikimedia.org/T113616) [22:03:22] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4169796 (10Urbanecm) I signed L3. There are some upcoming issues with the NDA, I hope they'll be resolved soon. [22:05:10] (03CR) 10Volans: [C: 04-1] "I think there are few things to improve, see comments inline." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (owner: 10Gilles) [22:06:32] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169799 (10MarcoAurelio) Is this related to T192604 anyhow? Regards. [22:12:29] (03CR) 10Volans: "Thanks for the review, I'll fix it on Wed." (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [22:20:24] (03CR) 10Krinkle: "Is the wwwportals.conf portion of this patch testable in Beta? E.g. to confirm the [a-z] rules work there?" [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [22:24:12] (03CR) 10EddieGP: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [22:45:34] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 36.41, 31.48, 32.05 [22:52:43] looking [22:57:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4169881 (10Cmjohnson) drained flea power, updated bios and idrac f/w to and powered back on BIOS Version 2.7.1 Firmware Version 2.52.52.52 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180430T2300). [23:00:04] James_F, Smalyshev, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:25] I'll do the SWAT [23:00:26] * James_F waves. [23:00:29] Thanks, RoanKattouw . [23:00:44] RoanKattouw: My two config changes are/should be no-ops. [23:02:19] (03PS1) 10Ottomata: Blacklisting change-prop and job topics from main -> analytics Mirror [puppet] - 10https://gerrit.wikimedia.org/r/430006 (https://phabricator.wikimedia.org/T189464) [23:03:18] (03CR) 10Ottomata: [C: 032] Blacklisting change-prop and job topics from main -> analytics Mirror [puppet] - 10https://gerrit.wikimedia.org/r/430006 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [23:04:04] (03CR) 10Sbisson: [C: 031] Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) (owner: 10Catrope) [23:05:10] (03PS2) 10Catrope: Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [23:05:13] (03CR) 10Catrope: [C: 032] Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [23:05:26] (03PS2) 10Catrope: Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) (owner: 10Jforrester) [23:05:27] !log ores1001: rm -rf /srv/deployment/ores/venv/ (T193422) [23:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:34] (03CR) 10Catrope: [C: 032] Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) (owner: 10Jforrester) [23:05:34] T193422: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422 [23:07:19] (03Merged) 10jenkins-bot: Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [23:07:35] (03Merged) 10jenkins-bot: Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) (owner: 10Jforrester) [23:08:11] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169909 (10Dzahn) @awight confirmed! it only existed on ores1001 and i deleted it. confirm... [23:08:42] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Clean up deprecated shared virtualenv directories - https://phabricator.wikimedia.org/T193422#4169912 (10Dzahn) 05Open>03Resolved [23:08:46] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4169913 (10Dzahn) [23:10:01] (03CR) 10jenkins-bot: Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [23:12:30] RoanKattouw: At least +2 https://gerrit.wikimedia.org/r/#/c/429986/ and https://gerrit.wikimedia.org/r/#/c/430000/ so we aren't waiting forever? :-) [23:12:51] oops, sorry, missed the message [23:13:05] RoanKattouw: hopefull not too late? [23:13:17] No you're good [23:14:19] (03CR) 10Catrope: [C: 032] Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429853 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [23:14:53] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Config cleanup patches from SWAT (duration: 01m 00s) [23:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:36] (03Merged) 10jenkins-bot: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429853 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [23:17:30] SMalyshev: Your patch is on mwdebug1002, please test [23:17:50] testing [23:19:21] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 23.10, 22.92, 24.00 [23:19:30] RoanKattouw: seems to be working ok [23:19:40] OK, I'll sync [23:20:03] thanks [23:21:21] !log catrope@tin Synchronized wmf-config/: USe internal cluster for SPARQL services (T192942) (duration: 01m 02s) [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:25] T192942: Identify and migrate existing internal clients of wdqs to the new internal cluster - https://phabricator.wikimedia.org/T192942 [23:22:26] (03CR) 10Catrope: [C: 032] Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) (owner: 10Catrope) [23:22:36] (03PS2) 10Catrope: Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) [23:22:40] (03CR) 10Catrope: [C: 032] Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) (owner: 10Catrope) [23:23:37] RoanKattouw: AF is ready, VE still merging. [23:23:43] Thanks, pulling AF in [23:24:05] (03Merged) 10jenkins-bot: Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) (owner: 10Catrope) [23:33:59] !log catrope@tin Synchronized php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php: Fix notices when disallowing edits (duration: 00m 59s) [23:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:13] James_F: VE change on mwdebug1002, please test [23:35:48] RoanKattouw: Yeah, LGTM. [23:36:33] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set $wgKartographerUsePageLanguage to false everywhere (T192955) (duration: 00m 59s) [23:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:37] T192955: Add config setting for default map language - https://phabricator.wikimedia.org/T192955 [23:38:02] !log catrope@tin Synchronized php-1.32.0-wmf.1/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.DiffPage.init.js: T192755 (duration: 00m 59s) [23:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:05] T192755: Simply cloning diff header disrupts user scripts and causes other issues - https://phabricator.wikimedia.org/T192755 [23:38:37] Alright, that's the SWAT done [23:42:34] yay [23:53:13] (03PS2) 10Dzahn: admins: add bitpogo and tieu to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/429460 (https://phabricator.wikimedia.org/T191523) [23:57:18] (03PS1) 10Dzahn: wmfusercontent.org: add SPF record to disable email [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408)