[00:25:35] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [00:32:35] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [00:41:02] mergification..... sweet sweet patch mergification :D [01:47:44] marlier: it was all fine after a second puppet run, no worries [02:01:48] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) (owner: 10Subramanya Sastry) [02:23:07] (03PS1) 10Dzahn: install_server: let all mw2* hosts use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) [02:24:11] (03PS2) 10Dzahn: install_server: let all mw2* hosts use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) [02:38:02] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4163145 (10EddieGP) Once upon a time, appservers would insert jobs into a database table and jobrunner... [03:16:38] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:29] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.048 second response time [03:17:44] well..ok. that paged and immediately recovered [03:33:55] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4163154 (10aaron) refreshLinks2 is not used anymore. Since it is not in $wgJobClasses anymore, they pr... [03:57:38] mutante: some might say that is the best type of page then, apart from never occuring in the first place [04:05:45] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [04:19:26] (03PS1) 10Andrew Bogott: labs_bootstrapvz: try to stabilize the fqdn [puppet] - 10https://gerrit.wikimedia.org/r/429374 [04:21:35] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: try to stabilize the fqdn [puppet] - 10https://gerrit.wikimedia.org/r/429374 (owner: 10Andrew Bogott) [05:06:56] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:07:55] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time [05:18:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429375 [05:18:30] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429375 [05:21:41] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429375 (owner: 10Marostegui) [05:23:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429375 (owner: 10Marostegui) [05:23:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429375 (owner: 10Marostegui) [05:24:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113 after alter table (duration: 01m 10s) [05:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429376 (https://phabricator.wikimedia.org/T190148) [05:29:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:30:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:30:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429376 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:31:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105 for alter table (duration: 00m 59s) [05:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:36] !log Deploy schema change on db1105:3312 - T191519 T188299 T190148 [05:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:42] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:31:42] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:31:42] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:44:48] (03PS1) 10Marostegui: s5.hosts: Add db1116:3315 to s5 [software] - 10https://gerrit.wikimedia.org/r/429377 (https://phabricator.wikimedia.org/T190704) [05:45:50] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1116:3315 to s5 [software] - 10https://gerrit.wikimedia.org/r/429377 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:46:34] (03Merged) 10jenkins-bot: s5.hosts: Add db1116:3315 to s5 [software] - 10https://gerrit.wikimedia.org/r/429377 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:12:45] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:12:55] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:14:25] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:15:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:15:56] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:16:46] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:22:26] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:23:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:29:16] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/acme_tiny.py] [06:34:39] (03CR) 10Elukey: [C: 031] Remove partman fallback for mediawiki hosts to single disk partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [06:41:16] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:41:25] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [06:57:35] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:57:45] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [06:59:15] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:23] (03PS2) 10Muehlenhoff: Stop installing oggvideotools [puppet] - 10https://gerrit.wikimedia.org/r/429265 [07:03:20] (03CR) 10Muehlenhoff: [C: 032] Stop installing oggvideotools [puppet] - 10https://gerrit.wikimedia.org/r/429265 (owner: 10Muehlenhoff) [07:04:38] (03PS2) 10Muehlenhoff: Remove partman fallback for mediawiki hosts to single disk partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) [07:05:01] (03Abandoned) 10Jcrespo: mariadb: Depool db1069 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429153 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [07:06:03] (03CR) 10Muehlenhoff: [C: 032] Remove partman fallback for mediawiki hosts to single disk partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [07:09:14] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4163239 (10Lea_WMDE) cool, thanks for the heads up! [07:32:58] !log swift eqiad-prod more weight to ms-be104[0-3] - T190081 [07:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:04] T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081 [07:46:35] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4163250 (10Marostegui) a:03Marostegui I have backuped the tables and left them at: `/srv/backups/tmp/T57385/T57385.tar.gz` Most of the tables are e... [07:48:42] (03CR) 10Muehlenhoff: [C: 031] "Looks good, we could just as well switch the regex for mw[12]* to install the -lvm variant. From my point of view, it's fine that a few se" [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) (owner: 10Dzahn) [07:54:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4163266 (10MarcoAurelio) @EddieGP @aaron Thanks for the information. Sti... [08:04:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [08:05:39] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:07:01] (03PS1) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [08:10:11] (03CR) 10Urbanecm: [C: 04-1] "See inline comments." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [08:11:10] (03CR) 10Urbanecm: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [08:14:46] !log reimaging mwdebug2001 to stretch [08:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:01] (03PS2) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [08:21:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:21:59] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [08:30:29] (03PS3) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [08:36:08] (03PS4) 10Gehel: wdqs: add standard prometheus JVM monitoring to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) [08:44:18] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/11051/" [puppet] - 10https://gerrit.wikimedia.org/r/429382 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [08:46:19] !log installing mysql 5.5 security update (distro-packaged version) on trusty [08:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:36] (03PS3) 10Alexandros Kosiaris: ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [08:54:41] (03CR) 10Alexandros Kosiaris: [C: 032] ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [08:54:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [08:56:55] (03PS3) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) [08:57:04] (03PS4) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) [08:58:49] !log reimage analytics10[51,53] to Debian Stretch [08:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:10] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [09:05:19] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [09:05:55] (03PS1) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [09:06:11] ^ akosiaris: your puppet merge probably [09:06:35] moritzm: hm yeah I had left it at the yes/no prompt [09:06:42] ADHD it seems [09:06:45] sorry about that [09:06:46] (03CR) 10jerkins-bot: [V: 04-1] idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [09:07:02] happens to me every weeks, often only missing the ENTER even :-) [09:07:19] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [09:07:39] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:09:54] (03PS2) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [09:11:11] (03CR) 10jerkins-bot: [V: 04-1] idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [09:11:12] the memcached errors are due to mwdebug2001 [09:11:23] https://logstash.wikimedia.org/app/kibana#/dashboard/memcached [09:11:46] all nutcracker related [09:12:00] what about that Eventbus error? [09:12:19] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [09:12:25] mobrovac: --^ [09:12:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:12:59] (03CR) 10MarcoAurelio: "meh, dblist sorting order again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [09:18:06] (03PS3) 10MarcoAurelio: idwikimedia: initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) [09:30:51] 10Operations, 10DBA, 10MediaWiki-Database: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224#4163403 (10jcrespo) [09:31:52] 10Operations, 10DBA, 10MediaWiki-Database: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224#4163394 (10jcrespo) p:05Triage>03Low (Low for now, likely to get a boost for the 15-year plan) [09:32:26] (03CR) 10Urbanecm: [C: 031] "LGTM, thank you." [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [09:34:04] (03PS1) 10Jcrespo: mariadb: Convert db1118 into an s1 test hosts with MySQL 8.0 [puppet] - 10https://gerrit.wikimedia.org/r/429388 (https://phabricator.wikimedia.org/T193226) [09:34:33] (03PS2) 10Jcrespo: mariadb: Convert db1118 into an s1 test hosts with MySQL 8.0 [puppet] - 10https://gerrit.wikimedia.org/r/429388 (https://phabricator.wikimedia.org/T193226) [09:39:42] (03PS1) 10Muehlenhoff: Allow removing Diamond gradually (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/429389 [09:40:08] (03CR) 10jerkins-bot: [V: 04-1] Allow removing Diamond gradually (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/429389 (owner: 10Muehlenhoff) [09:43:27] (03CR) 10Marostegui: "We should manually test this once it is merged, to make sure the regex being added to MySQL is correct. So many issues have been seen in t" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [09:45:22] (03CR) 10Jcrespo: [C: 04-1] "I am almost sure this is not ok, the _ have to be sql-escaped, both on _p and potentially on wiki names." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [09:51:26] (03CR) 10Jcrespo: [C: 04-1] "example: GRANT SELECT, SHOW VIEW ON `roa\_tarawiki\_p`.* TO 'labsdbuser'" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [09:53:48] !log reimaging mwdebug1001 to stretch [09:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:42] (03PS5) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [09:58:26] elukey: that error rate is emitted by the proxy service [09:58:46] increased 4xx means mediawiki is sending malformed msgs, 5xx means problems with kafka conns [09:59:41] mobrovac: sure, but I didn't see a lot of issues for that particular alarm, did you see if anything weird happened? [09:59:52] looking [09:59:58] looks like only 4xx, no 5xx [10:00:01] so something with mediawiki [10:02:23] i see timeouts from CP [10:04:37] (03PS1) 10Muehlenhoff: Remove mwdebug1001 from debug proxy config (being reimaged) [puppet] - 10https://gerrit.wikimedia.org/r/429391 [10:04:54] I think that's caused by the reimage, I forgot about those backwards compat names [10:05:04] timeouts from the proxy service too [10:05:06] ^ mobrovac, elukey: quick review of the patch? [10:06:04] (03CR) 10Mobrovac: [C: 031] Remove mwdebug1001 from debug proxy config (being reimaged) [puppet] - 10https://gerrit.wikimedia.org/r/429391 (owner: 10Muehlenhoff) [10:06:32] elukey: i also see "unable to connect to redis server" [10:06:58] for a lot of redis servers [10:07:06] (03PS2) 10Muehlenhoff: Remove mwdebug1001 from debug proxy config (being reimaged) [puppet] - 10https://gerrit.wikimedia.org/r/429391 [10:07:12] (03CR) 10Elukey: [C: 031] "Not a lot of context on this config but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/429391 (owner: 10Muehlenhoff) [10:07:21] actually forgot one entry, updated the patch [10:07:49] (03CR) 10Muehlenhoff: [C: 032] Remove mwdebug1001 from debug proxy config (being reimaged) [puppet] - 10https://gerrit.wikimedia.org/r/429391 (owner: 10Muehlenhoff) [10:07:57] mostly seemed to have affected refreshlinks jobs [10:09:11] Volker_E: sorry, forgot to investigate yesterday, looking now [10:09:18] mobrovac: the redis failures were on the cp side? [10:09:50] nope elukey, on the job execution side (mw jobrunners) [10:11:11] ah snap [10:14:05] Volker_E: uh oh, I am looking but I am not sure what went wrong, I am not familiar with MultimediaViewer repo or qunit tests; I would recommend that you either reopen the task, or create a new one [10:16:57] (03PS1) 10Ema: Add n_lru_limited counter [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429394 [10:21:45] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [10:23:46] again ^ ? [10:23:47] wth? [10:29:55] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [10:30:08] uh elukey, a lot timeout errors in the eventbus proxy service "KafkaTimeoutError" [10:30:25] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1229 is CRITICAL: Host mw1229 is not in mediawiki-installation dsh group Muehlenhoff reimage issue, will be re-done next week [10:30:25] ACKNOWLEDGEMENT - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Muehlenhoff reimage issue, will be re-done next week [10:30:31] (03PS1) 10Ema: Add cache_hit_grace counter [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429395 (https://phabricator.wikimedia.org/T192368) [10:30:32] i can also see the 400s in the logs, but the service doesn't say why [10:33:46] (03PS2) 10Ema: Add cache_hit_grace counter [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429395 (https://phabricator.wikimedia.org/T192368) [10:39:53] !log installing uwsgi/Django security updates on graphite2001 [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:21] (03PS2) 10Ladsgroup: mediawiki: Delete pre-2016 autopatrol actions from logging table of wikidata [puppet] - 10https://gerrit.wikimedia.org/r/428297 (https://phabricator.wikimedia.org/T189596) [10:51:48] (03CR) 10Jcrespo: [C: 032] mediawiki: Delete pre-2016 autopatrol actions from logging table of wikidata [puppet] - 10https://gerrit.wikimedia.org/r/428297 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [10:55:31] (03PS1) 10Muehlenhoff: Revert "Remove mwdebug1001 from debug proxy config (being reimaged)" [puppet] - 10https://gerrit.wikimedia.org/r/429398 [10:57:13] (03PS2) 10Muehlenhoff: Revert "Remove mwdebug1001 from debug proxy config (being reimaged)" [puppet] - 10https://gerrit.wikimedia.org/r/429398 [10:57:56] (03CR) 10Muehlenhoff: [C: 032] Revert "Remove mwdebug1001 from debug proxy config (being reimaged)" [puppet] - 10https://gerrit.wikimedia.org/r/429398 (owner: 10Muehlenhoff) [11:03:33] (03PS1) 10MarcoAurelio: euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) [11:10:13] (03CR) 10Jcrespo: [C: 032] mariadb: Convert db1118 into an s1 test hosts with MySQL 8.0 [puppet] - 10https://gerrit.wikimedia.org/r/429388 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [11:10:19] (03PS3) 10Jcrespo: mariadb: Convert db1118 into an s1 test hosts with MySQL 8.0 [puppet] - 10https://gerrit.wikimedia.org/r/429388 (https://phabricator.wikimedia.org/T193226) [11:13:07] !log installing uwsgi/Django security updates on graphite hosts in eqiad [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:33] !log ladsgroup@terbium:/var/log/wikidata$ mwscript updateCollation.php --wiki=fawiki --previous-collation=xx-uca-fa [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:20] (03PS3) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) [11:26:59] only [11:27:36] (03CR) 10Hoo man: "Note: I didn't yet test my new changes… will do that later today." [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [11:27:45] heh [11:34:24] (03PS1) 10Jcrespo: mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) [11:34:50] (03PS2) 10Jcrespo: mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) [11:34:56] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [11:35:11] mobrovac: here I am sorry [11:35:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [11:35:27] (03PS2) 10Muehlenhoff: Switch scap proxy for D5 to mw1251 [puppet] - 10https://gerrit.wikimedia.org/r/428934 [11:37:16] (03CR) 10Muehlenhoff: [C: 032] Switch scap proxy for D5 to mw1251 [puppet] - 10https://gerrit.wikimedia.org/r/428934 (owner: 10Muehlenhoff) [11:38:03] (03PS3) 10Jcrespo: mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) [11:38:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [11:41:22] !log reimaging mwdebug2002 to stretch [11:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:35] (03PS4) 10Jcrespo: mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) [11:42:30] (03CR) 10Jcrespo: [C: 032] mariadb: Use mysql-specific template for non-mariadb test host [puppet] - 10https://gerrit.wikimedia.org/r/429401 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [11:50:42] (03PS1) 10Jcrespo: mariadb: Change ssl-verify-server-cert option for 8.0 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/429402 [11:51:31] (03CR) 10Jcrespo: [C: 032] mariadb: Change ssl-verify-server-cert option for 8.0 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/429402 (owner: 10Jcrespo) [12:19:48] (03PS1) 10Ladsgroup: mediawiki: remove --from-id from deleteAutoPatrolLog script run [puppet] - 10https://gerrit.wikimedia.org/r/429407 (https://phabricator.wikimedia.org/T189596) [12:22:31] (03PS1) 10Jcrespo: mysql: Add auth_socket support to MySQL, equivalent to unix_socket on mariadb [puppet] - 10https://gerrit.wikimedia.org/r/429408 (https://phabricator.wikimedia.org/T193226) [12:22:48] (03CR) 10Jcrespo: [C: 032] mediawiki: remove --from-id from deleteAutoPatrolLog script run [puppet] - 10https://gerrit.wikimedia.org/r/429407 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [12:22:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:23:30] (03PS2) 10Jcrespo: mysql: Add auth_socket support to MySQL, equivalent to unix_socket on mariadb [puppet] - 10https://gerrit.wikimedia.org/r/429408 (https://phabricator.wikimedia.org/T193226) [12:26:59] (03PS1) 10Muehlenhoff: Remove mwdebug1002 from debug proxies for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/429409 [12:27:27] moritzm: sustained memcached errors, is this related to the upgrade or something else? [12:28:12] jynus: yeah it is mwdebug2002 afaics https://logstash.wikimedia.org/app/kibana#/dashboard/memcached?_g=h@66534ad&_a=h@c822620 [12:28:14] (03PS2) 10Muehlenhoff: Switch scap proxy in C6 to mw1320 [puppet] - 10https://gerrit.wikimedia.org/r/429260 [12:28:29] nutcracker complaining [12:28:49] (03CR) 10Muehlenhoff: [C: 032] Switch scap proxy in C6 to mw1320 [puppet] - 10https://gerrit.wikimedia.org/r/429260 (owner: 10Muehlenhoff) [12:29:48] ok [12:29:53] it was so high [12:29:59] that I though it was on of the main ones [12:30:09] sorry for pinging [12:30:43] (03CR) 10Jcrespo: [C: 032] mysql: Add auth_socket support to MySQL, equivalent to unix_socket on mariadb [puppet] - 10https://gerrit.wikimedia.org/r/429408 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [12:30:50] (03PS3) 10Jcrespo: mysql: Add auth_socket support to MySQL, equivalent to unix_socket on mariadb [puppet] - 10https://gerrit.wikimedia.org/r/429408 (https://phabricator.wikimedia.org/T193226) [12:43:42] 10Operations, 10Analytics, 10EventBus: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163743 (10elukey) p:05Triage>03High [12:45:10] (03Abandoned) 10Muehlenhoff: Remove unused role::mediawiki::scaler [puppet] - 10https://gerrit.wikimedia.org/r/428296 (owner: 10Muehlenhoff) [12:46:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:46:29] 10Operations, 10Analytics, 10EventBus, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163776 (10elukey) [12:49:17] 10Operations, 10Analytics, 10EventBus, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163779 (10Imarlier) @elukey yes, makes sense - I'll fix in a little bit. Sorry for the noise! [12:49:30] 10Operations, 10Analytics, 10EventBus, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163780 (10Imarlier) a:05elukey>03Imarlier [12:56:22] (03PS1) 10Jcrespo: mariadb: Cleaup plugin load configuration [puppet] - 10https://gerrit.wikimedia.org/r/429411 (https://phabricator.wikimedia.org/T193226) [12:56:51] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Cleaup plugin load configuration [puppet] - 10https://gerrit.wikimedia.org/r/429411 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [12:56:53] (03PS2) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) [12:58:40] (03PS2) 10Jcrespo: mariadb: Cleaup plugin load configuration [puppet] - 10https://gerrit.wikimedia.org/r/429411 (https://phabricator.wikimedia.org/T193226) [12:59:03] 10Operations, 10Analytics, 10EventBus, 10Performance-Team, 10Patch-For-Review: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163799 (10elukey) Ah nice! I sent a code review as attempt to fix this, but I can abandon it if you have something ready, no problem! [13:05:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429413 [13:08:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429413 (owner: 10Marostegui) [13:09:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429413 (owner: 10Marostegui) [13:09:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429413 (owner: 10Marostegui) [13:10:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105 after alter table (duration: 00m 59s) [13:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] (03CR) 10Jcrespo: [C: 032] mariadb: Cleaup plugin load configuration [puppet] - 10https://gerrit.wikimedia.org/r/429411 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [13:37:11] (03PS1) 10Filippo Giunchedi: prometheus: define recording rules for k8s alerts [puppet] - 10https://gerrit.wikimedia.org/r/429415 (https://phabricator.wikimedia.org/T193186) [13:37:13] (03PS1) 10Filippo Giunchedi: k8s: simplify prometheus alerts with recording rules [puppet] - 10https://gerrit.wikimedia.org/r/429416 (https://phabricator.wikimedia.org/T193186) [13:50:13] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#4163922 (10chasemp) 05Open>03Resolved Note T193196 is related for next phases here but this is racked/stack/imaged [13:51:19] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4163930 (10chasemp) [13:52:55] (03PS3) 10Filippo Giunchedi: Initial debianization [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) [13:53:04] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review+1" (036 comments) [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [13:54:28] (03PS3) 10Filippo Giunchedi: elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) [13:55:18] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [13:55:20] (03CR) 10Filippo Giunchedi: [C: 032] elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [13:58:42] gehel: FYI I didn't realize nginx on elastic machines will be restarted on merging ^, any trouble known in doing so? [13:59:03] (03PS1) 10Muehlenhoff: Move scap proxy for B7 to mw1314 [puppet] - 10https://gerrit.wikimedia.org/r/429420 [13:59:05] godog: should not be an issue [13:59:32] gehel: ok, thanks! [13:59:51] (03CR) 10Muehlenhoff: [C: 032] Move scap proxy for B7 to mw1314 [puppet] - 10https://gerrit.wikimedia.org/r/429420 (owner: 10Muehlenhoff) [14:07:45] (03CR) 10Filippo Giunchedi: [C: 032] Initial debianization [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [14:07:48] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Initial debianization [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [14:18:23] 10Operations, 10Patch-For-Review, 10User-Elukey: Apache reload fails on stretch-based app servers - https://phabricator.wikimedia.org/T185195#4163989 (10MoritzMuehlenhoff) I've posted a summary to the Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725#65 We could add a workaround to our tm... [14:18:56] 10Operations, 10Patch-For-Review, 10User-Elukey: tmpreaper doesn't play along with PrivateTmp systemd units - https://phabricator.wikimedia.org/T185195#4163991 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [14:23:41] !log Running populateRevisionLength.php on group 2 for T192189 [14:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:46] T192189: RevisionArchiveRecord incorrectly changes null ar_len to 0 - https://phabricator.wikimedia.org/T192189 [14:27:10] (03CR) 10Filippo Giunchedi: [C: 031] "I'm also ok switching mw[12]* to the default recipe" [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) (owner: 10Dzahn) [14:27:54] (03Abandoned) 10Filippo Giunchedi: cassandra: switch to using jmx-exporter jar from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/402070 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [14:33:19] \o/ [14:33:42] ah no misread, I thought we were dropping graphite metrics D [14:33:44] :D [14:34:06] (03CR) 10Giuseppe Lavagetto: "LGTM, see the small comment; we can still solve that issue later and globally for all prometheus exporters." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [14:34:22] (03CR) 10Filippo Giunchedi: "I'm +1 on the idea, some comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [14:34:29] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.095 second response time [14:34:47] elukey: heheh for cassandra/restbase that's already true (no graphite) [14:36:13] HI [14:36:20] I have problem with changing email [14:36:37] In gerrit settings [14:36:48] I got link for confirmation, and I confirmed it [14:37:08] But, when I want to save change, I get message: realm does not allow changing name [14:39:07] godog: ah! So let's do the same for AQS! [14:39:29] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.089 second response time [14:39:29] elukey: yeah, to pause cassandra-metrics-collector you just have to drop a file on the filesystem [14:39:37] the path of course escapes me now [14:40:32] ack, will follow up [14:46:52] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Add s1-test (db1118) to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/429425 (https://phabricator.wikimedia.org/T193226) [14:47:28] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. [14:47:31] (03CR) 10Jcrespo: [C: 032] prometheus-mysqld-exporter: Add s1-test (db1118) to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/429425 (https://phabricator.wikimedia.org/T193226) (owner: 10Jcrespo) [14:53:11] (03PS1) 10Andrew Bogott: wmcs puppetmasters: make cert_secret_path configurable per deploy [puppet] - 10https://gerrit.wikimedia.org/r/429428 (https://phabricator.wikimedia.org/T181523) [14:55:12] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 [14:56:10] (03PS2) 10Elukey: role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 [15:05:27] (03PS2) 10Andrew Bogott: wmcs puppetmasters: make cert_secret_path configurable per deploy [puppet] - 10https://gerrit.wikimedia.org/r/429428 (https://phabricator.wikimedia.org/T181523) [15:10:11] (03CR) 10Andrew Bogott: [C: 032] wmcs puppetmasters: make cert_secret_path configurable per deploy [puppet] - 10https://gerrit.wikimedia.org/r/429428 (https://phabricator.wikimedia.org/T181523) (owner: 10Andrew Bogott) [15:10:29] (03PS1) 10Andrew Bogott: Added new dummy files for labtestpuppetmaster puppet certs [labs/private] - 10https://gerrit.wikimedia.org/r/429430 [15:10:42] (03CR) 10Andrew Bogott: [V: 032 C: 032] Added new dummy files for labtestpuppetmaster puppet certs [labs/private] - 10https://gerrit.wikimedia.org/r/429430 (owner: 10Andrew Bogott) [15:12:18] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11054/" [puppet] - 10https://gerrit.wikimedia.org/r/429429 (owner: 10Elukey) [15:15:29] PROBLEM - dhclient process on boron is CRITICAL: Return code of 255 is out of bounds [15:15:29] PROBLEM - Disk space on boron is CRITICAL: Return code of 255 is out of bounds [15:15:58] PROBLEM - Check size of conntrack table on boron is CRITICAL: Return code of 255 is out of bounds [15:15:58] PROBLEM - Check systemd state on boron is CRITICAL: Return code of 255 is out of bounds [15:15:59] PROBLEM - Check whether ferm is active by checking the default input chain on boron is CRITICAL: Return code of 255 is out of bounds [15:16:08] PROBLEM - DPKG on boron is CRITICAL: Return code of 255 is out of bounds [15:16:08] PROBLEM - configured eth on boron is CRITICAL: Return code of 255 is out of bounds [15:18:38] PROBLEM - puppet last run on boron is CRITICAL: Return code of 255 is out of bounds [15:21:23] poor boron [15:21:52] somebody is hammering it with a huge build [15:22:48] (03CR) 10Alexandros Kosiaris: [C: 031] "no prometheus expert here, looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/429415 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [15:23:36] (03CR) 10Alexandros Kosiaris: [C: 031] "wow this looks so much cleaner and nicer!!!! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/429416 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [15:23:58] RECOVERY - Check size of conntrack table on boron is OK: OK: nf_conntrack is 0 % full [15:23:58] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational [15:24:02] nagios-nrpe-server was dead on boron :) [15:24:08] RECOVERY - Check whether ferm is active by checking the default input chain on boron is OK: OK ferm input default policy is set [15:24:08] RECOVERY - DPKG on boron is OK: All packages OK [15:24:09] RECOVERY - configured eth on boron is OK: OK - interfaces up [15:24:25] and the winner is [15:24:27] [Fri Apr 27 15:14:07 2018] Out of memory: Kill process 16128 (java) score 111 or sacrifice child [15:24:38] RECOVERY - dhclient process on boron is OK: PROCS OK: 0 processes with command name dhclient [15:24:40] RECOVERY - Disk space on boron is OK: DISK OK [15:25:01] moritzm: anything that you were building? ---^ [15:27:06] (03PS1) 10Andrew Bogott: bootstrapvz first boot: Go back to using named 'puppet' puppetmaster name [puppet] - 10https://gerrit.wikimedia.org/r/429437 (https://phabricator.wikimedia.org/T181523) [15:28:39] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few minor pedantic comments to make the code a bit more readable. Otherwise, LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429229 (owner: 10Volans) [15:38:58] (03CR) 10Bstorm: "As is, the script tries to create the database using that exact variable. That suggests that any time the script can successfully create " [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [15:39:51] (03CR) 10Bstorm: "It wouldn't be hard to escape it, if that turns out to be needed!" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [15:44:52] (03CR) 10Bstorm: wiki replicas: add GRANT statement to $wiki_p database creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [15:46:03] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, and 2 others: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164266 (10alanajjar) [15:47:05] (03PS1) 10Ema: Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 [15:50:01] (03PS2) 10ArielGlenn: generate checksums on a per job basis, updating the hash as needed [dumps] - 10https://gerrit.wikimedia.org/r/429245 [15:51:16] (03CR) 10jerkins-bot: [V: 04-1] Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 (owner: 10Ema) [15:51:28] (03PS1) 10Urbanecm: Enable RCPatrol in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429442 (https://phabricator.wikimedia.org/T193242) [15:53:14] (03PS2) 10Ema: Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 [15:57:37] (03CR) 10jerkins-bot: [V: 04-1] Introduce ttl_now and the new way of calculating TTLs in VCL [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 (owner: 10Ema) [16:11:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4164320 (10elukey) [16:11:38] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1980 bytes in 0.120 second response time [16:17:04] (03CR) 10Bstorm: "On my local database, the commands it produces work correctly without further manipulation." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:19:26] (03CR) 10Jcrespo: [C: 04-2] "Please do not merge this code with a security vulnerability." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:19:29] (03CR) 10Bstorm: "I will add that the self.db_p variable that is inserted here is strictly a scrubbed DB name with an '_p' added to it--it isn't a parameter" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:19:37] !log imarlier@tin Started deploy [statsv/statsv@d5108c4]: Update statsv to force the Kafka broker API version [16:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:42] !log imarlier@tin Finished deploy [statsv/statsv@d5108c4]: Update statsv to force the Kafka broker API version (duration: 00m 05s) [16:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:55] (03CR) 10Jcrespo: [C: 04-2] "Chase and Bryan will provide context." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:21:15] (03CR) 10Bstorm: "I will check into that. It is notable that the variable passed into that spot is not modifiable by the user (it is not passed on the comm" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:22:21] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4164325 (10Cmjohnson) @chasemp Confirmed both are 10G w/2 nics, labnet1004 can go to B2...I do not currently have any labnet server in that rack. labnet1001 is in B3 and... [16:22:53] (03PS2) 10Andrew Bogott: bootstrapvz first boot: Go back to using named 'puppet' puppetmaster name [puppet] - 10https://gerrit.wikimedia.org/r/429437 (https://phabricator.wikimedia.org/T181523) [16:23:18] (03CR) 10Jcrespo: [C: 04-2] "> I will check into that. It is notable that the variable passed" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:24:57] (03CR) 10Bstorm: "Oh ok! I can make it escape such characters. The variable cannot be set to anything that is not an actual db name set in the dblist file" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:25:02] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz first boot: Go back to using named 'puppet' puppetmaster name [puppet] - 10https://gerrit.wikimedia.org/r/429437 (https://phabricator.wikimedia.org/T181523) (owner: 10Andrew Bogott) [16:25:40] (03CR) 10Jcrespo: [C: 04-2] "Reproduction: Outside of the real hosts, provide grants for test_p, create a test3p database with a "secret" table, now a user has access " [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:26:27] (03CR) 10Bstorm: "I get it :) I'll make it right." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:26:45] !log imarlier@tin Started deploy [performance/navtiming@c059a60]: Deploying navtiming.py with support for enable/disable via etcd [16:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:50] !log imarlier@tin Finished deploy [performance/navtiming@c059a60]: Deploying navtiming.py with support for enable/disable via etcd (duration: 00m 05s) [16:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] (03PS2) 10Bstorm: wiki replicas: add GRANT statement to $wiki_p database creation [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) [16:33:05] (03CR) 10Ema: "This is the failing exception: https://github.com/wikimedia/operations-debs-varnish4/blob/debian-wmf/bin/varnishtest/tests/c00041.vtc#L85" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429440 (owner: 10Ema) [16:33:36] (03CR) 10Bstorm: "This will work as long as there are no other legal characters in DB names that will cause concern in a grant." [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:34:39] (03PS1) 10Chad: Apache redirects: keep query string attached [puppet] - 10https://gerrit.wikimedia.org/r/429447 [16:35:44] (03PS2) 10Chad: Apache redirects: keep query string attached [puppet] - 10https://gerrit.wikimedia.org/r/429447 [16:36:09] (03CR) 10Jcrespo: "Could this be tested somehow?" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:37:39] (03CR) 10Bstorm: "I ran it locally (with a variable changed so it uses the username "floop" and using the wiki "floop_enwiki") with debug on to verify that " [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:40:21] (03PS4) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) [16:40:55] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4164353 (10herron) mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, there are a few configs with mx1001 hardco... [16:41:41] (03CR) 10Hoo man: "Tested the last version now and actually found a bug (which I fixed)" [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [16:44:48] (03PS1) 10Chad: Greatly simplify svn.wikimedia.org redirects [puppet] - 10https://gerrit.wikimedia.org/r/429449 [16:44:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4164358 (10elukey) 05Open>03Resolved Changes deployed by @Imarlier, everything looks good now! Thanks! [16:46:39] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.101 second response time [16:47:07] (03CR) 10Hoo man: "Tested with testwikidata: Dump creation, recovery on failure, giving up after 5 failures in a row!" [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [16:53:19] (03CR) 10Jcrespo: "So we have this script check_private_data.py that checks periodically no bad data is available on labs. Not part of the scope of this comm" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:56:03] (03CR) 10Bstorm: "Definitely won't deploy today. I am working on a functional test locally that I can use to confirm consistency with grants that are on th" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [16:57:35] (03PS1) 10Chad: Apache redirects: rewrite all WMF URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/429452 [17:01:21] (03CR) 10Jcrespo: "This is the file that runs every week and sends us an alerts by email if something is wrong: https://phabricator.wikimedia.org/source/oper" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [17:03:09] (03PS3) 10Elukey: role::analytics_cluster::hadoop::master: change the namenode's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/429429 (https://phabricator.wikimedia.org/T193257) [17:04:56] (03CR) 10Muehlenhoff: Allow removing Diamond gradually (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:09:14] (03PS3) 10Dzahn: install_server: let all mw2* hosts use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) [17:10:59] (03CR) 10Dzahn: [C: 032] "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/429370 (https://phabricator.wikimedia.org/T106381) (owner: 10Dzahn) [17:14:17] (03CR) 10Dzahn: [C: 04-1] "please also add the ".m." entry for mobile" [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [17:15:02] (03CR) 10Dzahn: [C: 04-1] "see further down in the file following the "; Mobile" line" [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [17:15:38] (03CR) 10Bstorm: "For the local functional test:" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [17:19:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4164448 (10Jgreen) 05Open>03Resolved Casey's got this host up and running, closing task! [17:30:51] (03PS1) 10Herron: standard::mail::sender: run a smtp daemon on localhost:25 [puppet] - 10https://gerrit.wikimedia.org/r/429456 (https://phabricator.wikimedia.org/T175361) [17:34:48] (03PS3) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [17:35:43] (03PS4) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [17:43:34] (03CR) 10Marostegui: "> Definitely won't deploy today. I am working on a functional test" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [17:44:11] (03PS1) 10Herron: profile::kafka::burrow: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429457 (https://phabricator.wikimedia.org/T175361) [17:45:09] (03PS2) 10Herron: profile::kafka::burrow: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429457 (https://phabricator.wikimedia.org/T175361) [17:45:14] (03PS1) 10Dzahn: admins: add Tobias Schumann to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/429458 (https://phabricator.wikimedia.org/T192549) [17:46:08] (03CR) 10Dzahn: [C: 032] admins: add Tobias Schumann to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/429458 (https://phabricator.wikimedia.org/T192549) (owner: 10Dzahn) [17:48:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1946 bytes in 0.080 second response time [17:59:00] (03PS1) 10Dzahn: admins: add bitpogo and tieu to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/429460 (https://phabricator.wikimedia.org/T191523) [17:59:02] (03PS1) 10Andrew Bogott: labtest puppetmaster: change httpyaml url to resemble the main one [puppet] - 10https://gerrit.wikimedia.org/r/429461 [18:00:30] (03CR) 10Andrew Bogott: [C: 032] labtest puppetmaster: change httpyaml url to resemble the main one [puppet] - 10https://gerrit.wikimedia.org/r/429461 (owner: 10Andrew Bogott) [18:15:10] (03PS1) 10Chad: Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 [18:16:53] !log mw2167,mw2168,mw2169 - reinstalling with stretch and raid1-lvm [18:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:07] (03CR) 10Ottomata: "Some nits, but almost +1 from me! :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [18:29:56] https://usercontent.irccloud-cdn.com/file/7udjbRBJ/Screen%20Shot%202018-04-27%20at%208.29.37%20PM.png [18:30:01] ottomata: Puppet? [18:30:20] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4164534 (10chasemp) a:03aborrero [18:34:53] (03CR) 10Imarlier: [C: 04-1] "No longer needed." [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [18:43:31] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 0 hours old. [18:51:42] (03PS5) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) [18:52:21] (03CR) 10MarcoAurelio: "> please also add the ".m." entry for mobile" [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [18:52:32] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4164607 (10chasemp) [18:53:36] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#4164611 (10chasemp) 05Open>03Resolved in favor of T193264 [18:53:51] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.109 second response time [18:55:28] (03PS1) 10Mforns: Add cron job to sanitize EventLogging data in Hive [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) [18:55:56] (03CR) 10jerkins-bot: [V: 04-1] Add cron job to sanitize EventLogging data in Hive [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [18:56:52] (03CR) 10Mforns: [C: 04-1] "Please, do not merge this until" [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [18:57:40] (03CR) 10Ottomata: "You also need to change the usage of the data_drop class :)" [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [18:58:44] (03CR) 10Mforns: [C: 04-1] "> You also need to change the usage of the data_drop class :)" [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [18:59:16] (03CR) 10Dzahn: [C: 032] idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [18:59:48] (03PS2) 10Mforns: Add cron job to sanitize EventLogging data in Hive [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) [19:00:22] (03CR) 10Mforns: [C: 04-1] "Please, do not merge this until" [puppet] - 10https://gerrit.wikimedia.org/r/429465 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [19:00:44] (03PS1) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:01:14] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [19:02:44] (03CR) 1020after4: "Why switch to pull?" [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [19:03:03] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164661 (10MarcoAurelio) [19:03:46] (03PS2) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:03:51] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164229 (10MarcoAurelio) Strangely all of them got stuck at Meta-Wiki. Someone please in logsta... [19:04:17] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [19:05:25] (03PS3) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:13:59] (03CR) 10ArielGlenn: "Because we want to manage all dataset rsyncs the same way, from the web server, as part of the migration to the labstore1006,7 hosts." [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [19:15:10] (03CR) 10Jforrester: [C: 031] Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 (owner: 10Chad) [19:17:36] (03CR) 10BryanDavis: wiki replicas: add GRANT statement to $wiki_p database creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [19:27:07] (03PS4) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:29:59] (03PS5) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:32:31] (03CR) 10Bstorm: "I tested the built-in escaping setup, and I found that it does not modify it to what is requested here. The issue is that we want \_ in t" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [19:33:37] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164722 (10Tgr) There are thousands of `Wikimedia\Rdbms\LoadBalancer::{closure}: found writes p... [19:34:04] (03CR) 10Bstorm: "I can always play with the templating just to be sure, but I think this is the only way to make sure exactly that string ends up in the gr" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [19:36:28] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) (owner: 10Subramanya Sastry) [19:37:23] (03Abandoned) 10Ori.livneh: coal: add a simple systemd watchdog notifier; set WatchdogSec=60 [puppet] - 10https://gerrit.wikimedia.org/r/421981 (owner: 10Ori.livneh) [19:41:40] (03PS6) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:42:14] (03CR) 10jerkins-bot: [V: 04-1] openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [19:45:04] (03PS7) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:45:22] tgr: thanks for checking logstash, can those be unblocked then? [19:45:28] manually I mean [19:45:47] too bad the script breaks the date and time of the unification [19:48:22] (03PS8) 10Andrew Bogott: openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) [19:49:09] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164756 (10Tgr) None of those renames started on meta, I'll just run one by hand and see how th... [19:51:19] Aliya one went okay [19:52:16] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164760 (10Tgr) Seems to have worked fine, let's clean them all up. [19:58:18] !log T193254 ran fixStuckGlobalRename.php for: Aliya klein Hasselb Husseinzadeh02 Jswf845 Lorraine Fgr Mikeypugs0134 Ncanty STEEEPGlobal Sunlight me THOR Global Defense Group TPBox Zenas Gao אֲבִי גְדוֹר ぽっぽ大将軍 [19:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:22] T193254: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254 [19:58:46] Hauskatze: no clue what happened here, seems like the jobs didn't even start [19:59:00] but then I'm not sure how reliable the job queue logs are [20:00:02] the date handling is annoying but not trivial to fix [20:00:09] tgr: on beta the job queue is broken so we have no global rename [20:00:22] some kafka sh** or something like that [20:00:56] why would anyone want to be globally renamed on beta? [20:01:17] (03CR) 10Andrew Bogott: [C: 032] openstack: refactor 'cloudrepo' setup [puppet] - 10https://gerrit.wikimedia.org/r/429466 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [20:01:23] well I had a couple of instances that I had to fix with the script [20:01:37] besides, it is annoying and prevents testing possible enhacements on the feature [20:04:49] tgr: please watch https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/H%C3%BCseynzad%C9%99 -- queued on minwiki for some time already, maybe it's not working there too? [20:10:05] (03CR) 10Jcrespo: "Welcome to my world :-)" [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) (owner: 10Bstorm) [20:11:56] should be switched back to use the old jobqueue if it is permanently broken, I suppose [20:18:09] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164842 (10alanajjar) Is it finished? I see [[https://meta.wikimedia.org/wiki/Special:GlobalRen... [20:19:35] 10Operations, 10cloud-services-team, 10monitoring: Prometheus vs. CPU usage vs. hyperthreading - https://phabricator.wikimedia.org/T193272#4164846 (10Andrew) [20:22:58] (03PS2) 10Andrew Bogott: admin: Allow wmcs-roots access to role::labs::monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/423960 (https://phabricator.wikimedia.org/T162404) (owner: 10BryanDavis) [20:23:50] (03CR) 10Andrew Bogott: [C: 032] admin: Allow wmcs-roots access to role::labs::monitoring hosts [puppet] - 10https://gerrit.wikimedia.org/r/423960 (https://phabricator.wikimedia.org/T162404) (owner: 10BryanDavis) [20:24:55] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164860 (10MarcoAurelio) https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/H%C3%BCse... [20:27:14] (03PS2) 10Catrope: Enable mapframe on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428851 (https://phabricator.wikimedia.org/T191584) [20:27:22] (03CR) 10Krinkle: Apache redirects: rewrite all WMF URLs to https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429452 (owner: 10Chad) [20:31:18] (03CR) 10Krinkle: Greatly simplify svn.wikimedia.org redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429449 (owner: 10Chad) [20:34:51] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164869 (10alanajjar) >>! In T193254#4164860, @MarcoAurelio wrote: > https://meta.wikimedia.org... [20:34:55] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-Site-requests, 10Wikimedia-log-errors: Please unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T193254#4164870 (10alanajjar) 05Open>03Resolved a:03Tgr [20:52:21] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4164880 (10RobH) [20:53:32] (03PS1) 10RobH: decom db2011 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/429521 (https://phabricator.wikimedia.org/T187886) [20:54:00] (03CR) 10RobH: [C: 032] decom db2011 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/429521 (https://phabricator.wikimedia.org/T187886) (owner: 10RobH) [20:55:49] (03PS1) 10RobH: decom db2011 [puppet] - 10https://gerrit.wikimedia.org/r/429522 (https://phabricator.wikimedia.org/T187886) [20:56:09] (03CR) 10RobH: [C: 032] decom db2011 [puppet] - 10https://gerrit.wikimedia.org/r/429522 (https://phabricator.wikimedia.org/T187886) (owner: 10RobH) [20:57:16] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4164890 (10RobH) [20:57:37] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10RobH) a:05RobH>03Papaul ready for onsite completion of steps [21:00:38] (03CR) 10Krinkle: Make webperf role install coal things (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [21:03:28] (03CR) 10Krinkle: Make webperf role install coal things (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [21:06:01] (03CR) 10Krinkle: Make webperf role install coal things (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [21:37:31] ebernhardson: do you still need the screen on elastic1020? [21:48:22] mutante: probaly not, checking [21:48:44] closed [21:48:58] ebernhardson: thanks :) [22:01:12] RECOVERY - Long running screen/tmux on elastic1020 is OK: OK: No SCREEN or tmux processes detected. [22:01:37] ACKNOWLEDGEMENT - DPKG on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:01:37] ACKNOWLEDGEMENT - dhclient process on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:01:37] ACKNOWLEDGEMENT - Disk space on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:01:37] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2168 is CRITICAL: Host mw2168 is not in mediawiki-installation dsh group daniel_zahn reinstall resumed [22:01:37] ACKNOWLEDGEMENT - HHVM processes on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:01:37] ACKNOWLEDGEMENT - nutcracker port on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:01:38] ACKNOWLEDGEMENT - nutcracker process on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall resumed [22:02:42] PROBLEM - HHVM rendering on mw2170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:04:13] PROBLEM - HHVM rendering on mw2169 is CRITICAL: connect to address 10.192.32.57 and port 80: Connection refused [22:04:13] PROBLEM - HHVM processes on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:04:13] PROBLEM - nutcracker process on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:04:13] PROBLEM - nutcracker port on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:05:35] (03CR) 10Chad: Apache redirects: rewrite all WMF URLs to https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429452 (owner: 10Chad) [22:07:18] !log Running quibble-vendor-mysql-php70-docker against ~ 900 MediaWiki extensions. Triggered with a custom gear-client.py script from contint1001. PID 29710 [22:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:31] (one by one, so it is not going to be an issue) [22:15:43] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 81072 bytes in 3.686 second response time [22:18:23] RECOVERY - HHVM processes on mw2168 is OK: PROCS OK: 6 processes with command name hhvm [22:20:19] (03PS7) 10Dzahn: icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) [22:22:56] (03CR) 10Dzahn: [C: 031] "rebased, we have tested the "new" version of the commands for a while (Rob and Daniel have already been using them). now making "new" the " [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) (owner: 10Dzahn) [22:23:12] (03PS8) 10Dzahn: icinga: add notification type to SMS content and other improvements [puppet] - 10https://gerrit.wikimedia.org/r/406535 (https://phabricator.wikimedia.org/T185862) [22:24:38] RECOVERY - nutcracker port on mw2168 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [22:25:37] RECOVERY - nutcracker process on mw2169 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [22:25:38] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 81072 bytes in 7.072 second response time [22:36:38] (03CR) 10Krinkle: Apache redirects: rewrite all WMF URLs to https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429452 (owner: 10Chad) [22:38:39] (03CR) 10Jforrester: [C: 031] multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [22:44:20] (03CR) 10Krinkle: Remove wikipedia.org vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [22:45:22] !log m2171,mw2172,mw2173 ff. - reinstalling with stretch and raid1-LVM [22:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:26] (03CR) 10Krinkle: [C: 031] multiversion: Don't use getRealmSpecificFilename where it's not needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428865 (owner: 10Chad) [23:03:56] (03PS5) 10EddieGP: Remove wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/398396 [23:04:39] (03CR) 10EddieGP: Remove wikipedia.org vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [23:07:38] (03CR) 10Dzahn: "oops, doesn't seem to work:" [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn) [23:07:44] (03CR) 10Dzahn: [C: 04-1] base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 (owner: 10Dzahn) [23:44:52] 10Operations, 10Patch-For-Review: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#4165623 (10Dzahn) several have been fixed. updated output, now using cumin instead of salt: eqiad: mw1221.eqiad.wmnet: no sdb in mdstat or no raid mw1222.eqiad.wmnet: no sdb in mdstat or...