[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T0000). [00:02:27] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 4 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) [00:14:39] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10faidon) >>! In T207536#4689900, @GTirloni wrote: > @faidon the complete separation seems like a great goal from a se... [00:27:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Smalyshev) Huh this is a big one. I've thought about it a bunch l... [00:30:02] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10faidon) >>! In T207536#4692241, @aborrero wrote: > Please @faidon confirm I'm understanding this right. > > If I co... [00:35:39] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) >>! In T207536#4693519, @faidon wrote: > I don't think the intention was to put any pressure about doing th... [00:37:35] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) >>! In T207536#4693535, @faidon wrote: >>>! In T207536#4692241, @aborrero wrote: >> Please @faidon confirm... [01:01:13] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:07:54] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:33:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.224 second response time [01:34:04] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [01:36:24] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:43:05] (03CR) 10GTirloni: [C: 031] Move mail_smarthost (and wikimail_smarthost) to hiera [puppet] - 10https://gerrit.wikimedia.org/r/469524 (https://phabricator.wikimedia.org/T207887) (owner: 10Alex Monk) [03:30:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 966.00 seconds [03:32:01] 10Operations, 10Discovery-Search (Current work): Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10Mathew.onipe) p:05Triage>03Normal [03:35:34] 10Operations, 10Discovery-Search (Current work): Write cookbooks to support spicerack's elasticsearch multi cluster/instance - https://phabricator.wikimedia.org/T207919 (10Mathew.onipe) p:05Triage>03Normal [03:38:13] 10Operations, 10Discovery-Search (Current work): Test spicerack elasticsearch module on relforge or similar environment - https://phabricator.wikimedia.org/T207920 (10Mathew.onipe) p:05Triage>03Normal [03:43:42] 10Operations, 10Discovery-Search, 10Elasticsearch: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10Mathew.onipe) [03:47:45] (03PS4) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [03:50:10] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [03:50:51] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [03:53:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 152.66 seconds [03:58:33] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 1177 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:04:51] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Bawolff) [Just for context, i did small wikis, but I'll wait until talking to logstash folks before doing big wikis] [04:07:11] (03PS5) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [04:10:22] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [04:15:43] RECOVERY - High lag on wdqs1005 is OK: (C)3600 ge (W)1200 ge 1165 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:15:55] (03PS6) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [05:38:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:45] !log upload druid 0.12.3-1 debs to stretch-wikimedia [06:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:46] !log depooling wdqs1003 again, it's not catching up like the other hosts [06:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:04] (03PS1) 10Elukey: profile::eventlogging::analytics:files: do not delaycompress logs [puppet] - 10https://gerrit.wikimedia.org/r/469556 [06:24:07] (03CR) 10Elukey: [C: 032] profile::eventlogging::analytics:files: do not delaycompress logs [puppet] - 10https://gerrit.wikimedia.org/r/469556 (owner: 10Elukey) [06:28:44] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:29:13] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:33] PROBLEM - puppet last run on phab1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/diamond/collectors/ApacheStatusSimple/ApacheStatusSimple.py] [06:30:13] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 74181 bytes in 1.680 second response time [06:33:18] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:33:18] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/bin/swift-drive-audit] [06:58:13] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:25] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:23] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:13] RECOVERY - puppet last run on phab1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:11:17] !log installing requests security updates on trusty [07:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:28] !log Uploaded certcentral 0.3 to apt.wikimedia.org (stretch) - T207737 T207478 [07:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:34] T207478: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 [07:16:34] T207737: LE rejects issuing two certificates with the same CSR on a short timespan - https://phabricator.wikimedia.org/T207737 [07:29:53] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.118 second response time [07:33:14] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:38:43] (03PS1) 10Elukey: hive: introduce HIVE_SERVER2_HADOOP_OPTS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469562 (https://phabricator.wikimedia.org/T184794) [07:43:42] (03PS2) 10Muehlenhoff: Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 [07:44:43] (03CR) 10jerkins-bot: [V: 04-1] Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 (owner: 10Muehlenhoff) [07:46:07] (03PS3) 10Muehlenhoff: Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 [07:51:41] (03PS1) 10Elukey: hive: add ensure => 'directory' to /tmp/hive-parquet-logs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469563 [07:52:07] (03CR) 10Elukey: [V: 032 C: 032] hive: add ensure => 'directory' to /tmp/hive-parquet-logs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469563 (owner: 10Elukey) [07:53:31] (03PS1) 10Elukey: Update cdh submodule [puppet] - 10https://gerrit.wikimedia.org/r/469564 [07:54:52] (03CR) 10Elukey: [C: 032] Update cdh submodule [puppet] - 10https://gerrit.wikimedia.org/r/469564 (owner: 10Elukey) [07:57:59] (03PS4) 10Muehlenhoff: Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 [07:58:23] (03PS2) 10Elukey: hive: introduce HIVE_SERVER2_HADOOP_OPTS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469562 (https://phabricator.wikimedia.org/T184794) [08:04:58] (03CR) 10Elukey: [C: 032] hive: introduce HIVE_SERVER2_HADOOP_OPTS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469562 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [08:05:14] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10MoritzMuehlenhoff) >>! In T207792#4691594, @jijiki wrote: > @Addshore could you please give us some context on this (e.g. they are not working for WMDE anymore)? thank you! (As he's still li... [08:05:53] (03PS1) 10Elukey: Update cdh submodule [puppet] - 10https://gerrit.wikimedia.org/r/469567 [08:07:28] (03CR) 10DCausse: "left a small suggestion" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:11:33] (03CR) 10Elukey: [C: 032] Update cdh submodule [puppet] - 10https://gerrit.wikimedia.org/r/469567 (owner: 10Elukey) [08:16:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:20:33] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:20:57] (03CR) 10Gehel: wdqs: increase restart interval of wdqs-updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469447 (https://phabricator.wikimedia.org/T207843) (owner: 10Gehel) [08:21:42] it seemed one single spike [08:21:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:22:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:03] ema: --^ [08:26:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:53] probably reboots? [08:26:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:29:34] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:31:50] some failed fetches from https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [08:33:42] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) So this shows that we have less than 0.04% of packet loss on the elasticsearch eqiad cluster? I would expect a loss rate... [08:36:29] 10Operations, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Trizek-WMF) [08:37:21] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10MoritzMuehlenhoff) >>! In T207775#4691484, @fgiunchedi wrote: >>>! In T207775#4691005, @fgiunchedi wrote: >> We enabl... [08:39:08] 10Operations, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Mainframe98) Same error as with {T207928}. Trainblocker? [08:44:51] (03PS1) 10Elukey: hive: replace HADOOP_OPTS with HIVE_SERVER2_HADOOP_OPTS for hive-server2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469584 (https://phabricator.wikimedia.org/T184794) [08:45:54] (03CR) 10Elukey: [V: 032 C: 032] hive: replace HADOOP_OPTS with HIVE_SERVER2_HADOOP_OPTS for hive-server2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469584 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [08:46:39] 10Operations, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Mainframe98) Presumably caused by {rETRAb2586aebd94d805b82a018459b3197916a3b1992}. Cc'ing @cscott as author of the patch. [08:49:55] (03PS3) 10Filippo Giunchedi: [deployment-prep] fix elastic config for deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) (owner: 10DCausse) [08:50:11] 10Operations, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Trizek-WMF) >>! In T207930#4693990, @Mainframe98 wrote: > Same error as with {T207928}. Looks like it. I don't have that issue on Meta. [08:50:50] (03PS1) 10Elukey: role::analytics_cluster_coordinator: enable prometheus metrics for hive [puppet] - 10https://gerrit.wikimedia.org/r/469585 (https://phabricator.wikimedia.org/T184794) [08:50:53] (03CR) 10Filippo Giunchedi: [C: 032] [deployment-prep] fix elastic config for deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) (owner: 10DCausse) [08:52:22] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13197/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/469585 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [08:52:33] (03PS2) 10Elukey: role::analytics_cluster_coordinator: enable prometheus metrics for hive [puppet] - 10https://gerrit.wikimedia.org/r/469585 (https://phabricator.wikimedia.org/T184794) [08:53:37] (03CR) 10Elukey: [C: 032] role::analytics_cluster_coordinator: enable prometheus metrics for hive [puppet] - 10https://gerrit.wikimedia.org/r/469585 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [08:57:22] elukey, moritzm: hey [08:57:32] nope, I haven't started with the reboots yet this morning [08:58:46] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10fgiunchedi) Patch merged, though ferm fails because of a known... [09:00:48] ah, ok [09:02:40] I see no specific issues on the codfw backends, so perhaps a ulsfo<->codfw network blip? [09:06:24] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:06:30] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) [09:07:35] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) [09:08:33] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Wikimedia-production-error: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) [09:08:42] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Wikimedia-production-error: Move a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) [09:08:50] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [09:09:59] at any rate, it seems that was just a temporary glitch, safe to resume the reboots [09:10:09] !log resume cache hosts rolling reboots for kernel/microcode updates T203011 [09:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:15] !log elukey@deploy1001 Started deploy [analytics/turnilo/deploy@84bf1ad]: Upgrade to 1.8.1 [09:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:25] !log elukey@deploy1001 Finished deploy [analytics/turnilo/deploy@84bf1ad]: Upgrade to 1.8.1 (duration: 00m 10s) [09:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:10] dcausse: looks like elasticsearch on deployment-logstash2 is back up! [09:29:27] godog: \o/ [09:29:30] for some reason though new logstash indices are not being created [09:29:39] :/ [09:29:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [09:30:08] godog: might be a data.path mixup, I'll take a look [09:30:09] I have to go shortly, will take a look too later, in case someone wants to look now [09:30:13] thanks dcausse ! [09:30:15] sure [09:31:51] dcausse: I've turn on manually --debug in the logstash systemd unit and fixed manually the ferm rules due to https://phabricator.wikimedia.org/T205672#4694026 but no changes other than that besides what puppet did [09:32:11] ok [09:32:13] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:34:04] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [09:34:43] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:55] (03PS1) 10Vgutierrez: certcentral: Implement slow retries on challenge rejection by ACME dir. [software/certcentral] - 10https://gerrit.wikimedia.org/r/469590 (https://phabricator.wikimedia.org/T207927) [09:35:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:14] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) >>! In T207536#4693554, @Krenair wrote: >>>! In T207536#4693535, @faidon wrote: >> >> I'm a little confus... [09:37:37] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Wikimedia-production-error: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10MGChecker) [09:41:27] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Wikimedia-production-error: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10MGChecker) Reported at [[ https://www.mediawiki.org/wiki/Topi... [09:43:04] godog: I see that it still receives very old events [09:43:08] output received {"event"=>{"severity"=>6, "level"=>"INFO", "timestamp8601"=>"2018-10-22T19:01:31.456878+00:00" [09:43:20] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10LarsWirzenius) @hashar @jijiki Thanks! I confirm that I can see logstash and grafana now. [09:44:03] I have no clue how it can remember such old events, are they queued now (kafka or something else)? [09:46:15] oh yes I see "closing connection org.apache.kafka.common.network.Selector", it must be catching up its backlog, new indices should come up at some point I suppose [09:49:57] (03PS1) 10Muehlenhoff: Remove Pybal Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/469593 (https://phabricator.wikimedia.org/T183454) [09:49:59] (03PS1) 10Muehlenhoff: Remove Diamond from LVSes [puppet] - 10https://gerrit.wikimedia.org/r/469594 (https://phabricator.wikimedia.org/T183454) [09:51:43] !log resetting deployment directory on wdqs1003 [09:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:41] !log upgrade druid100[1-3] to druid 0.12.3 [10:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:38] (03CR) 10Alex Monk: [C: 032] certcentral: Implement slow retries on challenge rejection by ACME dir. [software/certcentral] - 10https://gerrit.wikimedia.org/r/469590 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [10:25:43] (03Merged) 10jenkins-bot: certcentral: Implement slow retries on challenge rejection by ACME dir. [software/certcentral] - 10https://gerrit.wikimedia.org/r/469590 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [10:27:33] (03CR) 10jenkins-bot: certcentral: Implement slow retries on challenge rejection by ACME dir. [software/certcentral] - 10https://gerrit.wikimedia.org/r/469590 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [10:30:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.326 second response time [10:33:24] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [10:39:08] mmmm the mw exceptions graph looks horrible since 7:30 AM [10:39:10] is it known? [10:39:22] -/sleep [10:39:26] er :O [10:39:41] Sorry, I have a command to do /away everywhere :P [10:39:47] Hmm, night. [10:47:19] Hallo. [10:47:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [10:47:34] I'm here for the SWAT that's happening soon. [10:47:57] next [10:48:07] logmsgbot: next [10:48:17] mmm this morning is difficult [10:49:05] jouncebot: next [10:49:05] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1100) [10:49:10] oh there you go [10:49:22] :) [10:50:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.482 second response time [10:53:53] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:56:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10MoritzMuehlenhoff) 05Resolved>03Open @kostajh : You're using the same key in production as in WMCS: This is a security risk since... [10:57:56] !log restart pdfrender on scb1003 [10:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:13] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1100). [11:00:04] bmansurov, Zoranzoki21, and aharoni: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] I can swat today [11:00:25] o/ zeljkof [11:01:07] bmansurov: if you are a deployer, feel free to deploy your patch [11:01:13] otherwise, I can do it [11:01:19] zeljkof: I'm not a deployer ;( [11:01:40] bmansurov: that's not a closed club, you can always become one ;) [11:01:55] (03PS3) 10Zfilipin: Stop collecting data CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [11:02:01] zeljkof: thanks, I'll keep it in mind. [11:02:16] bmansurov: I'll ping you in a few minutes when the patch is at mwdebug1002 and ready for testing [11:02:23] zeljkof: cool [11:03:15] bmansurov: just checking, a gerrit comments says " Deploy on 10/29" [11:03:18] did the timeline change? [11:03:23] zeljkof: yes [11:03:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [11:03:41] ok, merging [11:03:52] OK [11:05:13] (03Merged) 10jenkins-bot: Stop collecting data CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [11:07:04] bmansurov: it's at mwdebug1002, please test and let me know if I can deploy it [11:07:10] ok [11:07:32] zeljkof: it's working, please go on [11:07:37] (03CR) 10jenkins-bot: Stop collecting data CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [11:07:47] bmansurov: ok, deploying [11:08:11] (03PS7) 10GTirloni: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:08:58] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465418|Stop collecting data CitaitonUsage and CitationUsagePageLoad (T191086 T203253)]] (duration: 00m 57s) [11:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:03] T191086: Instrument and collect data via CitationUsage schema - https://phabricator.wikimedia.org/T191086 [11:09:03] T203253: Run a Second Round of Data Collection - https://phabricator.wikimedia.org/T203253 [11:09:15] bmansurov: it's deployed, please test and thanks for deploying with #releng ;) [11:09:57] hi zeljkof o/ [11:10:12] zeljkof: looks great. Thank you and great customer service you got at #releng :))) [11:12:07] bmansurov: we are here to server, until software replaces us ;) [11:12:17] "here to serve" [11:12:27] hi aharoni! [11:12:29] (03CR) 10GTirloni: [C: 032] ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:12:53] I have two patches to deploy. Both are needed for meaningful testing. [11:12:53] (03PS10) 10GTirloni: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [11:13:20] aharoni: ok, I'll ping you in a few minutes, just to deploy one simple commit [11:13:30] OK [11:13:34] ;) [11:14:33] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) (owner: 10Zoranzoki21) [11:15:35] (03Merged) 10jenkins-bot: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) (owner: 10Zoranzoki21) [11:16:49] Hi, I had problems with access [11:17:00] SWAT is not end? [11:17:00] Zoranzoki21: just deploying your commit :) [11:17:26] it looked simple enough, and there's nothing to test anyway... [11:17:28] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:469261|New throttle rule for Johannesburg Event on 2018-10-27 (T207742)]] (duration: 00m 55s) [11:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:31] T207742: Requesting temporary lift of IP cap for Johannesburg Event on 2018-10-27 - https://phabricator.wikimedia.org/T207742 [11:17:34] zeljkof: Yes [11:17:37] Zoranzoki21: it's deployed! :) [11:17:53] aharoni: please stand by, you're next! :) [11:18:06] ack [11:18:13] I'll ping you as soon as the first commit is at mwdebug1002 for testing [11:19:18] aharoni: so, um, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460895 is for master? [11:19:38] zeljkof: what do you mean exactly? [11:19:38] and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/469507 also? [11:19:44] ok, so let me explain [11:20:08] It's merged, and I need both deployed to production Wikipedias in all languages. [11:20:08] if a commit gets merged into master, like the two above are, they will be deployed during the next train deploy [11:20:13] 10Operations, 10SRE-Access-Requests: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10jijiki) p:05Triage>03Normal [11:20:23] so probably next week [11:20:43] depending on if they merged before the new deplyoment branch was cut [11:20:55] That's the problem: one before, one after. [11:21:06] if a commit needs to be deployed before the next train, we can deploy it during swat [11:21:07] and I need them together, in production today if possible. [11:21:18] now is SWAT, isn't it? [11:21:22] yes [11:21:35] but we do not deploy master to production [11:21:46] we deploy a deployment branch [11:22:12] so, a commit has to be cherry picked to a branch, merged there and deployed [11:22:24] Oh, I thought this is no longer needed. [11:22:30] Can I do it quickly? [11:22:40] to which branch? [11:22:42] since deployment situation is not good this week, I think new branch is only on group 0 [11:23:00] http://tools.wmflabs.org/versions/ [11:23:03] sure, it should be doable in a swat window, it's mostly waiting for CI [11:23:15] but 10-20 minutes, depending on jobs that run [11:23:33] (03CR) 10jenkins-bot: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) (owner: 10Zoranzoki21) [11:23:52] current branches are 1.32.0-wmf.26 (old, but around possibly for a few more days, or even until next week) [11:24:04] remind me please, what are groups 0, 1, 2? [11:24:20] and 1.33.0-wmf.1, new, should be on all wikis by Thursday, but in case of trouble, never .) [11:24:33] go to https://tools.wmflabs.org/versions/ [11:24:40] there are three boxes [11:24:51] left one is 0, middle is 1, right is 2 [11:24:55] oh I see [11:25:06] click the triangle and it expands the list of wikis [11:25:18] I need it on all Wikipedias, so groups 1 and 2. [11:25:19] so group 0 are small wikis and test wikis [11:25:44] group 1 is some middle ground, group 2 are big wikis, like enwiki [11:26:25] aharoni: so, 460895 is in the new branch? (I guess no action is needed then) [11:26:35] but 469507 is not? [11:26:53] I think it's the other way around. [11:27:23] 469507 got merged this morning, so it's unlikely it is in the deployment branch [11:28:06] 460895 merged a couple of days ago (October 23) so maybe in the deployment branch, checking [11:28:28] zeljkof: both group 1 and group 2 are on 1.32.0-wmf.26 now. [11:28:44] so I made https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469603/ [11:29:06] aharoni: yes, that's the old branch, hopefully going away today, but maybe not [11:29:58] Yeah, and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469507/ was indeed merged today, and it's not part of the branch cut on Tuesday. [11:30:04] Do I need to cherry-pick it? [11:30:41] aharoni: if you want it deployed, I need a commit in one (or both) currently deployed branches [11:30:51] so tldr: probably yes [11:30:52] :) [11:32:29] zeljkof: drat. it depends on the other patch, so I'm afraid I cannot cherry-pick it until the first one goes through CI and is merged. [11:32:45] or is there a way to rebase it somehow? [11:32:53] aharoni: in the same repo? you can chain commits, right? [11:33:09] uh, I don't think I've chained cherry picks before [11:34:15] zeljkof: another question: if the train runs tonight, will group 1 and group 2 be switched to 1.33.0-wmf.1 ? [11:35:07] if current problems are resolved, and if there are no problems while promoting the new branch to groups 1 and 2, then yes [11:35:28] but in practice, hard to tell, since unexpected problems are, well, unexpected :) [11:35:57] zeljkof: OK... so then the cherry-picks I'm doing now probably have to be done for 1.33.0-wmf.1, too? [11:36:34] aharoni: yes, if the commit(s) are not already in the branch, they have to be cherry picked and deployed in a swat window [11:36:47] OK [11:36:51] there are two more swat windows today (during US working hours) [11:36:51] godog, ema: Hi, please can you put me in contact with a Debian FreeNode Group Contact? [11:36:53] 10Operations: ferm fail to start at boot in some cases - https://phabricator.wikimedia.org/T207417 (10jijiki) [11:36:55] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race on jessie - https://phabricator.wikimedia.org/T148986 (10jijiki) [11:37:07] so probably not a good time for us, but somebody in the US can be around for SWAT [11:37:24] or if it's urgent, somebody might stay around for a late deploy :) [11:37:49] zeljkof: OK, I think I figured everything out [11:37:55] aharoni: I guess you're not in Portland, since you're awake? [11:38:16] zeljkof: no. I was invited actually, but I'm too busy with my family ;) [11:38:35] another baby coming up, if all goes well [11:38:57] oh, didn't know, congratulations! :) [11:39:04] thanks ;) [11:39:45] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207868 (10jijiki) p:05Triage>03High a:03Cmjohnson [11:40:06] zeljkof: So: I made https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469605/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469603/ [11:40:12] and I'm waiting for CI [11:42:38] aharoni: ok, so, it's unlikely that those commits will be merged and deployed in this swat window [11:42:47] :( [11:43:19] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207868 (10jijiki) [11:43:25] looks like CI will need 10-20 minutes, and then another 10-20 when merging, and there are 15 minutes left... [11:44:05] there are two more windows today, 18:00–19:00 and 23:00–00:00 UTC [11:44:13] is any of them good for you or somebody from your team? [11:45:05] if this is causing an outage or a serious problem, I can always extend the swat window, but if it can wait until the next window, that would be better [11:46:03] or, I can +2 the commits now, before the test pipeline jobs are done, if a job fails, the commits will not get merged anyway [11:46:23] aharoni: do you need a lot of time to test the commits at mwdebug1002? [11:46:32] no, super-short [11:46:49] 10Operations, 10Patch-For-Review: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10jijiki) p:05Triage>03Normal [11:47:11] zeljkof: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469603/ is ready [11:47:29] aharoni: does deploying one commit help, or do we need both? or more? [11:48:04] zeljkof: can you merge it https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469603/ perhaps? If you merge it, it won't be deployed, right? [11:48:37] aharoni: if I merge it, I should deploy it, but yes, it does not get deployed automatically [11:48:46] yeah, so you can do it. [11:48:49] it will be helpful [11:49:06] ok, in that case I'll merge and deploy it [11:49:31] aharoni: can you please update the calendar with the commits that will be deployed today, so we have a clean record? [11:49:36] OK [11:52:34] zeljkof: done [11:52:43] aharoni: thanks! [11:53:39] aharoni: it's very unlikely 469605 will get deployed now [11:53:55] we'll probably have to extend the window just to deploy 469603 [11:54:35] aaaand one job failed :/ https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/22203/ [11:55:06] argh [11:55:11] `npm ERR! shasum check failed for /tmp/npm-2330-b9caa7b6/registry.npmjs.org/core-js/-/core-js-2.5.7.tgz` [11:55:20] just what we needed CI trouble [11:55:46] aharoni: ok, so with 5 minutes left in the window, I suggest that we give up for this window :( [11:55:59] :( [11:56:09] I don't think we'll be able to deploy anything without extending the window for 30 minutes or more [11:56:30] that's doable, if this is really urgent, but only if so [11:56:37] so, how urgent is this? :) [11:57:35] zeljkof: not super-urgent, but what is this failure?! [11:57:53] looks like CI trouble with caching npm packages :/ [11:58:40] it will probably not happen if I rerun the jobs, but that is slowing everything for another 10-20 minutes :( [11:58:46] sigh [11:58:58] le sigh [11:59:17] so, giving up? can you reschedule for later today? [11:59:24] zeljkof: is it possible to rerun it, to at least get it merged? [11:59:35] or does it also have to be deployed if it's merged? [11:59:39] aharoni: ah, I can't leave stuff merged [11:59:44] I have to deploy [11:59:56] yes, I have to deploy merged stuff [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1200) [12:00:20] sigh. [12:00:23] I'll reschedule. [12:00:27] thanks for the help! [12:01:01] aharoni: ok, sorry for not being able to deploy, we are working on CI, making it faster and more robust, but it takes time... [12:01:26] aharoni: please update the calendar, I'll remove my +2 from the patch [12:01:59] !log EU SWAT finished [12:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] zeljkof: I'll update [12:02:49] thanks! [12:04:15] Krenair: hi, not sure exactly what group you're referring to, #debian-ops perhaps ? [12:04:37] godog, the debian group on freenode [12:04:42] with debian/* cloaks [12:05:06] I found a page about an IRC council with an email address that sounds like what I want, will check #debian-ops too [12:05:27] Krenair: ack, yeah I think #debian-ops might be able to help [12:05:36] cool thanks godog [12:05:47] dcausse: looks like the backlog flushed and today's indices are created \o/ thanks for your help [12:10:10] cool! [12:10:25] [head's up] cumin1001 is about to be rebooted in few minutes [12:11:03] zeljkof: "Pupper SWAT" is not appropriate for this, right? [12:11:46] aharoni: no, it's for deploying things in operations/puppet, as far as I know [12:12:36] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10fgiunchedi) Looks like logs in deployment-prep are back now (cc... [12:14:33] !log rebooting cumin1001 to pick new kernel and clear any potential weird state after OOMs [12:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:50] cumin1001 back online and at your service [12:21:52] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10dcausse) a:05dcausse>03Krenair I overlooked other instances... [12:43:51] 10Operations, 10Patch-For-Review: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10jijiki) a:03akosiaris [12:46:38] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10faidon) >>! In T207536#4694093, @aborrero wrote: > Ok, I think I understand this better now. > > But we still have... [12:48:04] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10fgiunchedi) >>! In T207900#4693299, @Bawolff wrote: >>>! In T207900#4693255, @faidon wrote: >> Cool! Cc'ing @herron and @fgiunchedi here for aw... [12:48:41] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 2 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) p:05Triage>0... [12:59:14] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1300) [13:01:41] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) 05Open>03Resolved Completed! ``` 14:00 :: Whois for: icinga-wm (~icinga-wm@wikimedia/bot/icinga-wm) ``` [13:01:47] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) [13:03:13] volans: related to the rabbithole in ^ T205522 can be resolved, what do you think? [13:03:13] T205522: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 [13:04:00] godog: I guess so, also we didn't had a full repro and we're moving to icinga1001 that might or might not have the same issue [13:04:06] jessie vs stretch [13:04:58] indeed, and the code at least won't swallow exceptions now, I'll resolve it [13:05:40] 10Operations, 10IRCecho, 10Patch-For-Review, 10User-fgiunchedi: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving as the code will log exceptions now and we haven't seen further crashes. [13:07:11] thanks for the cleanup [13:07:15] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) [13:07:18] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10faidon) 05Resolved>03Open I see emails for SG3 that (as far as I can tell) haven't made it to maint-announce, e.g. ``` Date: Thu, 25 Oct 2018 1... [13:09:51] np, I had set a reminder to check icinga-wm's cloack [13:09:54] cloak even [13:12:39] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, a further improvement might be to check for dpkg-dist conffiles left in case there are config changes to do" [puppet] - 10https://gerrit.wikimedia.org/r/469439 (owner: 10Ema) [13:15:34] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/469439 (owner: 10Ema) [13:18:06] (03PS1) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/469611 (https://phabricator.wikimedia.org/T207834) [13:22:06] (03CR) 10Ottomata: "Ahh sorry missed that, thanks elukey." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469563 (owner: 10Elukey) [13:23:13] (03CR) 10Ottomata: "Great!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469562 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [13:25:37] (03PS1) 10Filippo Giunchedi: hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) [13:25:39] (03PS1) 10Filippo Giunchedi: prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) [13:26:05] (03PS4) 10Ema: wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 [13:26:08] (03PS3) 10Muehlenhoff: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 [13:26:15] (03CR) 10jerkins-bot: [V: 04-1] hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:27:08] (03CR) 10Ema: [C: 032] wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 (owner: 10Ema) [13:27:20] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 2 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) Fix has been me... [13:28:01] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), 10Wikimedia-production-error: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10Nikerabbit) [13:28:27] !log test add term return-tcp permit on cr2-codfw [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:48] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) [13:29:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 4 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Ottomata) Thanks @Smalyshev, I think you are write that changes like this should be announced a bit better. We... [13:29:44] !log test successful, rollback add term return-tcp permit on cr2-codfw [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: bootstrap service node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/469614 (https://phabricator.wikimedia.org/T207591) (owner: 10Arturo Borrero Gonzalez) [13:30:18] (03CR) 10Ottomata: [C: 031] hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:30:48] (03CR) 10Ottomata: [C: 031] prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:31:32] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10fgiunchedi) >>! In T205416#4691176, @Lea_WMDE wrote: > Thanks @fgiunchedi! Just to be sure... [13:33:19] (03PS2) 10Filippo Giunchedi: hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) [13:33:21] (03PS2) 10Filippo Giunchedi: prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) [13:33:44] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:33:53] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:34:57] (03PS3) 10Filippo Giunchedi: hieradata: change burrow port for kafka logging [puppet] - 10https://gerrit.wikimedia.org/r/469612 (https://phabricator.wikimedia.org/T206454) [13:35:37] (03CR) 10Filippo Giunchedi: [C: 031] Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 (owner: 10Muehlenhoff) [13:37:03] (03PS3) 10Filippo Giunchedi: prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) [13:37:09] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: add Burrow metrics for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/469613 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:39:33] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [13:41:43] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 64, down: 19, shutdown: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:42:12] (03CR) 10Filippo Giunchedi: [C: 04-1] icinga: on stretch, tell rsyslog to discard logs from check_nrpe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [13:43:23] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational [13:44:06] (03PS5) 10Muehlenhoff: Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 [13:45:07] (03CR) 10ArielGlenn: "It looks like auto_ferm_ipv6 is a top-scoped variable as used in rsync::server::module. And I don't see the ipv6 ferm rules in the catalog" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [13:46:10] (03PS1) 10Anomie: Set CommentTableSchemaMigrationStage => WRITE_NEW on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469617 (https://phabricator.wikimedia.org/T166733) [13:46:35] !log reformat ms-be2043 xfs filesystems - T199198 [13:46:37] (03CR) 10Anomie: [C: 032] "Deploying planned config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469617 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [13:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:39] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:47:01] (03CR) 10Muehlenhoff: [C: 032] Switch prometheus-ops rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467990 (owner: 10Muehlenhoff) [13:47:43] (03Merged) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469617 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [13:48:55] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment table migration stage to write-new/read-both on all wikis (T166733) (duration: 00m 55s) [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:59] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [13:56:25] (03PS2) 10Muehlenhoff: Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/467991 [13:57:51] (03CR) 10Muehlenhoff: [C: 032] Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/467991 (owner: 10Muehlenhoff) [13:58:59] (03PS2) 10Muehlenhoff: Switch carbon rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467958 [14:01:10] (03CR) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469617 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:04:41] (03CR) 10Jcrespo: [C: 031] Fix PTR for db2042 [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [14:05:26] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Gehel) >>! In T206636#4690384, @Smalyshev wrote: > @Andrew A... [14:06:58] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10faidon) Good idea! I think upstream [[ https://github.com/NagiosEnterprises/nrpe/commit/fe006d2556c906de84321188630ab... [14:13:11] (03PS4) 10Jcrespo: Fix PTR for db2042 [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [14:19:23] (03CR) 10Muehlenhoff: [C: 032] Switch carbon rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467958 (owner: 10Muehlenhoff) [14:20:49] !log running dns update (gerrit patch: 467711) [14:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:14] (03PS2) 10Elukey: Add missing AAAA records for druid eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467701 (owner: 10Volans) [14:23:48] (03CR) 10Elukey: [C: 032] Add missing AAAA records for druid eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467701 (owner: 10Volans) [14:27:57] (03PS4) 10Muehlenhoff: Switch srvdumps rsync module to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467978 [14:28:42] !log upgrade druid on druid100[4-6] to Druid 0.12.3 [14:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:35] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10WMDE-leszek) Thanks for the attention. As the engineering manager at WMDE I confirm that person behind the user name "jk" is no longer doing software development/engineering work at WMF infra... [14:29:53] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [14:46:03] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10Addshore) >>! In T207792#4693923, @MoritzMuehlenhoff wrote: >>>! In T207792#4691594, @jijiki wrote: >> @Addshore could you please give us some context on this (e.g. they are not working for W... [14:46:11] morning James_F [14:46:46] Hey. [14:46:57] addshore: Ideas for next step? [14:49:10] (03CR) 10Muehlenhoff: "The ferm service name is based on name of the rsyncd service, so they won't clash. If multiple services are created for the rsyncd port, t" [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [14:49:12] (03PS1) 10Vgutierrez: certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) [14:50:56] 10Operations, 10ops-eqiad, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added label on server, added to switch asw-a-eqiad ge-6/0/18 up up weblog1001 and in private1-a [14:51:32] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10jlinehan) [14:52:21] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) (owner: 10Vgutierrez) [14:52:52] (03PS1) 10Addshore: Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 [14:52:58] jouncebot: now [14:52:59] For the next 0 hour(s) and 7 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1300) [14:53:01] James_F: ^^ [14:53:09] lets do that then turn it on again [14:53:10] jouncebot: next [14:53:11] In 1 hour(s) and 6 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1600) [14:53:26] James_F: in the office yet? :P [14:53:36] Just outside. [14:53:50] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) >>! In T206636#4690379, @Smalyshev wrote: >> I've cr... [14:53:59] (03PS2) 10Addshore: Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 [14:55:28] Let’s try it. [14:56:11] Will only fix lexeme though? [14:57:01] Yesterday you said there were three entity types still enabled? [14:58:32] addshore: Unless I mis-remember? [14:58:38] the three were all lexeme [14:58:44] Ah, OK. [14:58:45] i thought the fix was harder than this ^^ [14:58:46] Sure, let's do it. [14:58:49] 10Operations, 10ops-eqiad, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [14:59:05] 10Operations, 10ops-eqiad: apply hostname label for weblog1001/WMF4750 - https://phabricator.wikimedia.org/T207764 (10Cmjohnson) 05Open>03Resolved [14:59:08] 10Operations, 10ops-eqiad, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [14:59:13] addshore: You deploying or should I? [14:59:50] 10Operations, 10ops-eqiad, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [15:00:05] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) Router has been unracked and dropped at shipping for ship out. shipping information below {F26792797} [15:01:02] James_F: i can do [15:01:39] Go for it. [15:01:54] (03CR) 10Addshore: [C: 032] Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 (owner: 10Addshore) [15:02:04] !log test rsyslog 8.38 upgrade on lithium - T136312 [15:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] T136312: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312 [15:02:29] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10Ottomata) @nuria needs to give the sign off from analytics, but from my POV this is all correct! Yeehaw! This will be discussed in Monday's... [15:03:01] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10Nuria) Approved on my end. [15:03:57] (03PS3) 10Addshore: Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 [15:04:02] (03CR) 10Addshore: [C: 032] Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 (owner: 10Addshore) [15:05:02] (03Merged) 10jenkins-bot: Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 (owner: 10Addshore) [15:05:08] (03PS1) 10Muehlenhoff: Convert udp2log::rsyncd to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/469627 [15:06:44] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 26552 MB (5% inode=99%) [15:06:52] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson I had one remaining 4TB spare disks on-site. Replaced the disk, cleared the cache and all disks are back [15:07:17] syncing [15:07:49] (03PS2) 10Addshore: Revert "logging: Disable 'Wikibase.NewItemIdFormatter' channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 [15:07:52] (03CR) 10Addshore: [C: 032] Revert "logging: Disable 'Wikibase.NewItemIdFormatter' channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 (owner: 10Addshore) [15:07:59] (03PS2) 10Vgutierrez: certcentral: Avoid fast retry on local errors after cert is issued [software/certcentral] - 10https://gerrit.wikimedia.org/r/469624 (https://phabricator.wikimedia.org/T207927) [15:08:05] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Explicitly set wgLexemeEnableRepo for wikidatas [[gerrit:469625]] (duration: 00m 55s) [15:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:29] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/13204/" [puppet] - 10https://gerrit.wikimedia.org/r/469627 (owner: 10Muehlenhoff) [15:08:57] (03Merged) 10jenkins-bot: Revert "logging: Disable 'Wikibase.NewItemIdFormatter' channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 (owner: 10Addshore) [15:10:03] James_F: can you make the patch for turning wikibaserepo back on on beta commons? [15:10:10] Sure. [15:10:33] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [15:10:49] (03PS1) 10Jforrester: Revert "Revert "Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 [15:11:07] so many reverts :P [15:11:11] (03PS2) 10Jforrester: [Beta Cluster] Re-enable WBMI on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 [15:11:18] (03PS3) 10Addshore: [Beta Cluster] Re-enable WBMI on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 (owner: 10Jforrester) [15:11:21] I'm lazy. :-) [15:11:23] (03CR) 10Addshore: [C: 032] [Beta Cluster] Re-enable WBMI on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 (owner: 10Jforrester) [15:11:35] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "logging: Disable Wikibase.NewItemIdFormatter channel" (duration: 00m 55s) [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:04] (03Merged) 10jenkins-bot: [Beta Cluster] Re-enable WBMI on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 (owner: 10Jforrester) [15:13:59] (03PS6) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [15:14:41] 10Operations, 10ops-eqiad: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 (10Cmjohnson) @MoritzMuehlenhoff I am sure I have one buried in the 300 servers on the floor but the few that are easy to access are only 8GB. [15:16:05] 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10Cmjohnson) @BBlack the idrac h/w log does not show any failures [15:16:32] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [Beta Cluster] Re-enable WBMI on Beta Commons (duration: 00m 54s) [15:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:11] James_F: I guess now we wait for it to appear on beta? [15:17:15] Yeah. [15:17:19] Do-dee-doo. [15:17:45] James_F: mind pinging me once it is live again? :D [15:17:52] * addshore might go grab breakfast / head to the venue [15:18:07] Of course. Coming to the Foundation/Wikidata meeting in 12 minutes' time? I assumed it would be cancelled, but… [15:18:19] oh [15:18:21] forgot about [15:18:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Cmjohnson) @Bstorm is this okay to resolve now? [15:18:54] (03CR) 10jenkins-bot: Explicitly set wgLexemeEnableRepo for wikidatas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469625 (owner: 10Addshore) [15:18:56] (03CR) 10jenkins-bot: Revert "logging: Disable 'Wikibase.NewItemIdFormatter' channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 (owner: 10Addshore) [15:18:58] (03CR) 10jenkins-bot: [Beta Cluster] Re-enable WBMI on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469628 (owner: 10Jforrester) [15:19:30] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) 05Open>03Resolved @elukey okay [15:21:04] (03PS1) 10Muehlenhoff: When absenting an rsyncd module, also remove the ferm service [puppet] - 10https://gerrit.wikimedia.org/r/469629 [15:21:30] addshore: Now cancelled. [15:21:36] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) @elukey dbstore1002 is out of warranty and has 1.2T disks. I don't have disks this size but can replace with a 2TB disk.. [15:26:17] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/13205/" [puppet] - 10https://gerrit.wikimedia.org/r/469629 (owner: 10Muehlenhoff) [15:28:48] (03PS1) 10Muehlenhoff: Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/469630 [15:30:14] James_F: hooray [15:30:55] addshore: beta-scap-eqiad is running now. [15:31:29] James_F: yay [15:31:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) 05Open>03Resolved Looks great to me! Sorry this got buried. [15:31:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > I've build a new VM t206636-3 that should have... [15:31:53] !log depooling wdqs1003 again, it's not catching up like the other hosts [15:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:44] RECOVERY - Disk space on elastic1025 is OK: DISK OK [15:36:11] !log shutdown aqs1006 to replace one broken disk - T206915 [15:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:15] T206915: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 [15:36:31] addshore: And it's live. [15:36:46] yup, we still have issues but im gonna just investigate through the day rather than turn it off again [15:36:53] addshore: "Create an item" isn't listed, but a bunch still are. [15:36:54] lexemes are still listed for exmaple... [15:37:24] Shouldn't https://commons.wikimedia.beta.wmflabs.org/wiki/Special:ListDatatypes theoretically say "none"? [15:37:33] no [15:37:42] But… they aren't available? [15:38:37] addshore: Shall we do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/466954 then? [15:38:59] lets not do that one quite yet [15:39:01] but hopfully today [15:39:04] OK. [15:42:12] addshore: So… at first glance, I'd say Special:AvailableBadges, Special:EntitiesWithoutDescription, Special:EntitiesWithoutLabel, Special:GoToLinkedPage, Special:ItemByTitle, Special:ItemDisambiguation, Special:ItemsWithoutSitelinks, Special:MergeItems, Special:RedirectEntity, Special:SetDescription, Special:SetLabel, Special:SetSiteLink, Special:SetAliases, Special:SetLabelDescriptionAliases shouldn't be listed or enabled. [15:42:42] Special:DispatchStats, Special:ListProperties, Special:EntityData, Special:EntityPage, and Special:MyLanguageFallbackChain are fine, as is Special:ListDatatypes - but should be blank or in some other way showthat the data types are known but not allowed? [15:42:59] yup, can you list them on that ticket that i made, then we can keep looking through that [15:43:53] (03CR) 10Andrew Bogott: [C: 032] Remove all references to labs_metal [puppet] - 10https://gerrit.wikimedia.org/r/469532 (owner: 10Faidon Liambotis) [15:44:07] (03CR) 10Andrew Bogott: [C: 032] "The compiler likes this, let's give it a try :)" [puppet] - 10https://gerrit.wikimedia.org/r/469532 (owner: 10Faidon Liambotis) [15:44:15] (03PS2) 10Andrew Bogott: Remove all references to labs_metal [puppet] - 10https://gerrit.wikimedia.org/r/469532 (owner: 10Faidon Liambotis) [15:45:40] Sure. [15:45:51] (03PS1) 10Bstorm: sonofgridengine: refactor roles into wmcs namespace [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) [15:46:23] there was 10 minutes ago a spike on inserts and updates on enwiki [15:46:28] andrewbogott: cool :) lmk if follow-up is needed [15:46:28] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: refactor roles into wmcs namespace [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:47:44] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 4 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) > engineering@? wikitech-l? I think engineering is good, and probably wikitech too since the data c... [15:48:12] invalidateTitles it seems [15:48:24] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:50:00] (03PS1) 10Muehlenhoff: Use auto_ferm for profile::analytics::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/469635 [15:50:02] (03PS2) 10Bstorm: sonofgridengine: refactor roles into wmcs namespace [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) [15:50:21] *reads up* [15:53:15] ACKNOWLEDGEMENT - MD RAID on aqs1006 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207958 [15:53:25] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T207958 (10ops-monitoring-bot) [15:56:01] (03CR) 10Smalyshev: wdqs: increase restart interval of wdqs-updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469447 (https://phabricator.wikimedia.org/T207843) (owner: 10Gehel) [15:57:55] (03PS3) 10Bstorm: sonofgridengine: refactor roles into wmcs namespace [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) [15:58:21] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/13207/" [puppet] - 10https://gerrit.wikimedia.org/r/469635 (owner: 10Muehlenhoff) [16:00:05] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:31] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) I have not heard back from HP yet, I pinged them again [16:04:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) [16:04:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) a:05Andrew>03RobH [16:04:55] (03CR) 10Bstorm: [C: 032] sonofgridengine: refactor roles into wmcs namespace [puppet] - 10https://gerrit.wikimedia.org/r/469633 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [16:07:12] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10Jhernandez) Sounds good from my end too 👍 [16:16:21] (03CR) 10Alexandros Kosiaris: [C: 032] Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) (owner: 10MSantos) [16:16:28] (03PS5) 10Alexandros Kosiaris: Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) (owner: 10MSantos) [16:16:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) (owner: 10MSantos) [16:18:23] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28020 MB (5% inode=99%) [16:19:05] ACKNOWLEDGEMENT - MD RAID on aqs1006 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207964 [16:19:09] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T207964 (10ops-monitoring-bot) [16:20:44] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T207964 (10elukey) [16:20:47] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10elukey) [16:22:13] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) I sent HP a diagnostic log showing disk 5 as failed {F26794607} {F26794615} [16:24:49] !log installed patched nagios-nrpe-plugin and nagios-nrpe-server on icinga1001 - T207775 [16:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:54] T207775: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 [16:25:26] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but adding faidon as well who created that stanza" [puppet] - 10https://gerrit.wikimedia.org/r/469524 (https://phabricator.wikimedia.org/T207887) (owner: 10Alex Monk) [16:25:42] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) [16:26:26] (03CR) 10Smalyshev: [C: 031] wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/469611 (https://phabricator.wikimedia.org/T207834) (owner: 10Gehel) [16:28:23] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:30] this is me ^ [16:29:55] alongside the maps boxes [16:30:00] who will alert soon [16:31:34] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:52] you can see the future :-P [16:32:17] 4, 8, 15, 16, 23, 42 [16:32:20] you know what to do [16:32:23] (03PS1) 10Alexandros Kosiaris: osm::planet_sync: Specify correct cron parameter [puppet] - 10https://gerrit.wikimedia.org/r/469644 [16:32:26] lol [16:33:15] 10Operations, 10Wikimedia-Mailing-lists: New list request for 1lib1ref - https://phabricator.wikimedia.org/T207283 (10AVasanth_WMF) [16:34:19] !log decreasing relative weight of wdqs1003 in LVS to ease the updater [16:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:23] !log gehel@puppetmaster1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=wdqs,name=wdqs1004.codfw.wmnet [16:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:28] !log gehel@puppetmaster1001 conftool action : set/weight=20; selector: dc=eqiad,cluster=wdqs,name=wdqs1005.codfw.wmnet [16:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:21] (03CR) 10Alexandros Kosiaris: [C: 032] osm::planet_sync: Specify correct cron parameter [puppet] - 10https://gerrit.wikimedia.org/r/469644 (owner: 10Alexandros Kosiaris) [16:35:34] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:34] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:40:43] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:40:59] 10Operations, 10Icinga, 10fundraising-tech-ops: Why doesn't icinga notify the team-fr-tech contact for services in WARNING state? - https://phabricator.wikimedia.org/T207966 (10Jgreen) [16:41:53] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:43:42] (03PS1) 10Gehel: wdqs: switch wdqs1003 and wdqs1006 from public vs internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/469649 (https://phabricator.wikimedia.org/T207947) [16:48:07] (03PS2) 10Gehel: wdqs: switch wdqs1003 and wdqs1006 from public vs internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/469649 (https://phabricator.wikimedia.org/T207947) [16:51:37] (03CR) 10Smalyshev: [C: 031] wdqs: switch wdqs1003 and wdqs1006 from public vs internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/469649 (https://phabricator.wikimedia.org/T207947) (owner: 10Gehel) [16:55:14] RECOVERY - Disk space on elastic1025 is OK: DISK OK [16:56:51] (03PS16) 10Dzahn: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) (owner: 10Paladox) [16:59:02] James_F: :( its hard..... [16:59:07] anyway, time for sessions... [16:59:48] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10Cmjohnson) [16:59:51] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10Cmjohnson) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1700). [17:00:05] shdubsh: by install did you mean upgrade? [17:02:06] volans: maybe that is more accurate. it was the same version of nrpe, but with additional patches [17:02:11] (03CR) 10Dzahn: [C: 032] "replaces Bootstrap with bulma.io, removes jQuery completely" [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) (owner: 10Paladox) [17:02:38] shdubsh: ack, no prob :) [17:02:46] (03Abandoned) 10Alex Monk: [WIP] dnsrecursor: Rewrite code setting up lua hooks [puppet] - 10https://gerrit.wikimedia.org/r/304146 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [17:07:45] addshore: :-( [17:10:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) hey hey heyyy, the nodes are in! https://phabricator.wikimedia.org/T204177#4695147 How can we move this forwa... [17:17:46] (03PS1) 10Cmjohnson: Adding dns entries for an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469656 (https://phabricator.wikimedia.org/T207192) [17:17:59] (03PS2) 10Cmjohnson: Adding dns entries for an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469656 (https://phabricator.wikimedia.org/T207192) [17:19:17] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10EBjune) [17:20:14] !log planet - regenerating feeds for 'en' and 'de', others will follow by cron. switching to new theme. replaced bootstrap with bulma. removed jQuery. thanks to paladox [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:18] (03PS1) 10Cmjohnson: Merge branch 'master' of https://gerrit.wikimedia.org/r/p/operations/dns into mydnschanges [dns] - 10https://gerrit.wikimedia.org/r/469657 [17:20:32] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@95452cf]: Update mobileapps to 58cbdff (T206527) [17:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:39] T206527: [BUG] Citations not being parsed correctly - https://phabricator.wikimedia.org/T206527 [17:21:27] (03Abandoned) 10Cmjohnson: Merge branch 'master' of https://gerrit.wikimedia.org/r/p/operations/dns into mydnschanges [dns] - 10https://gerrit.wikimedia.org/r/469657 (owner: 10Cmjohnson) [17:22:28] (03Abandoned) 10Cmjohnson: Adding dns entries for an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469656 (https://phabricator.wikimedia.org/T207192) (owner: 10Cmjohnson) [17:22:36] (03CR) 10Cwhite: "Per T207775, it looks like consensus points to patching check_nrpe making this changeset unnecessary." [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [17:24:00] (03PS1) 10Alexandros Kosiaris: mathoid: Add various informational chart values [deployment-charts] - 10https://gerrit.wikimedia.org/r/469658 [17:24:02] (03PS1) 10Alexandros Kosiaris: Invert logic for specifying externalIPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/469659 [17:24:04] (03PS1) 10Alexandros Kosiaris: scaffold: Invert the externalIPs inclusion logic [deployment-charts] - 10https://gerrit.wikimedia.org/r/469660 [17:24:06] (03PS1) 10Alexandros Kosiaris: Add chartid to pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/469661 [17:24:08] (03PS1) 10Alexandros Kosiaris: WIP: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [17:24:22] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@95452cf]: Update mobileapps to 58cbdff (T206527) (duration: 03m 50s) [17:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:25:42] (03PS1) 10Bstorm: sonofgridengine: clean up the old roles after moving under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/469663 [17:26:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:26:53] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:27:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:29:42] (03CR) 10Bstorm: [C: 032] sonofgridengine: clean up the old roles after moving under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/469663 (owner: 10Bstorm) [17:34:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:35:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:36:26] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/includes/changetags/ChangeTags.php: 08f8e6a9d7f1dcb281321c5e3a3471169e68348d (duration: 00m 55s) [17:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:47] (03PS1) 10Cmjohnson: Adding dns entries an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) [17:43:01] (03CR) 10jerkins-bot: [V: 04-1] Adding dns entries an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) (owner: 10Cmjohnson) [17:46:57] (03PS2) 10Cmjohnson: Adding dns entries an-worker10[78-96] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) [17:47:57] (03PS2) 10Cmjohnson: Fix records for camera [dns] - 10https://gerrit.wikimedia.org/r/467709 (owner: 10Volans) [17:48:33] PROBLEM - Device not healthy -SMART- on aqs1006 is CRITICAL: cluster=aqs device=sde instance=aqs1006:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1006&var-datasource=eqiad%2520prometheus%252Fops [17:50:27] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/tests/phpunit/includes/page/WikiPageDbTestBase.php: f3b5a1df116f426c2809f2a266b9d761f15c349f (duration: 00m 55s) [17:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:59] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4967dba]: Test deploy new update & scripts [17:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:27] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4967dba]: Test deploy new update & scripts (duration: 00m 28s) [17:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:33] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/includes/page/WikiPage.php: f3b5a1df116f426c2809f2a266b9d761f15c349f (duration: 00m 54s) [17:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:55] (03PS1) 10Smalyshev: Re-enable kafka on test [puppet] - 10https://gerrit.wikimedia.org/r/469666 [17:56:24] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 6.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [17:58:57] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.1/extensions/Translate/tag: c5fa239917a870240ec4dcd8a617f0f8033aa9bf (duration: 00m 55s) [17:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1800). [18:00:04] stephanebisson and aharoni: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:34] Hi [18:00:54] Shalom from Jerusalem [18:02:55] who's deploying today? [18:03:33] aharoni: Hi! I'll deploy [18:03:39] aharoni: enjoy the lifts on saturday [18:03:45] :D [18:04:27] (lifts?) [18:04:39] it is said that due to sabbath [18:04:48] lifts stop on all floors [18:04:58] on saturdays:) [18:05:30] In some buildings, not in mine, thankfully :) [18:05:35] [ https://en.wikipedia.org/wiki/Shabbat_elevator ] [18:06:27] ah yes that one :D [18:07:02] (03PS3) 10Cmjohnson: Adding dns entries an-worker10[78-95] [dns] - 10https://gerrit.wikimedia.org/r/469664 (https://phabricator.wikimedia.org/T207192) [18:11:24] stephanebisson: so, I have three patches, all of them backports. The one that you've already merged is for wmf/1.33.0-wmf.1 , and it's definitely needed. [18:11:53] aharoni: I see one of them is failing jenkins [18:12:03] The other two are perhaps less important, but only if it's certain that the train will run later today and everything will be deployed to all the wikis. [18:12:16] stephanebisson: hmm, it's not supposed to. [18:12:23] zeljkof earlier today said that it's supposed to pass. [18:12:44] I've requested a recheck, we'll see [18:13:00] The 3rd one is the same as the first one, but for wmf.26, right? [18:13:07] stephanebisson: yes [18:13:16] it depends on the second one [18:13:26] wmf.1 is not on group1 yet, the train is delayed [18:13:34] (03PS1) 10Ottomata: Copy hive-site.xml to HDFS from a normal hive client, not coordinator node [puppet] - 10https://gerrit.wikimedia.org/r/469668 [18:13:37] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson) [18:13:58] stephanebisson: aha, so it's good to deploy all of them then. let's hope jenkins doesn't give us any more troubles. [18:14:04] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) [18:15:05] (03CR) 10Ottomata: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler1002/13210/" [puppet] - 10https://gerrit.wikimedia.org/r/469668 (owner: 10Ottomata) [18:15:14] joal fyi ^ [18:16:32] Many thanks for that ottomata [18:17:17] (03CR) 10Smalyshev: [C: 04-1] "pending manual testing" [puppet] - 10https://gerrit.wikimedia.org/r/469666 (owner: 10Smalyshev) [18:26:09] stephanebisson: looks like jenkins is better now [18:28:44] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:30:04] SMalyshev: ^ that's your test of kafka poller ? [18:34:39] aharoni: Your change (469605) is on mwdebug1001, can you test? [18:35:01] stephanebisson: ack [18:35:49] stephanebisson: it's all good, please proceed [18:36:42] deploying... [18:37:26] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.1/extensions/ContentTranslation/: SWAT: [[gerrit:469605|Remove the session parameter from AbuseFilter logging]] (duration: 00m 56s) [18:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] stephanebisson: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469603/ passed jenkins [18:38:18] stephanebisson: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469608/ will also need +2 [18:38:25] aharoni: yep, it's next. Actually, should the other 2 be deployed together? [18:38:34] stephanebisson: yes [18:41:14] (03CR) 10Smalyshev: [C: 031] "Seems to be working ok" [puppet] - 10https://gerrit.wikimedia.org/r/469666 (owner: 10Smalyshev) [18:42:18] stephanebisson: bleh, Jenkins complains about https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469608/ [18:43:10] aharoni: What's the error, I don't see anything [18:43:58] stephanebisson: https://integration.wikimedia.org/ci/job/mwgate-npm-node-6-docker/53225/console [18:44:07] probably recheck will fix it [18:54:28] stephanebisson: wow, slow Jenkins [18:54:43] https://gerrit.wikimedia.org/r/#/c/469603/ is close to being merged [18:55:08] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469608/ will probably need a recheck after 469603 is merged [18:55:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10faidon) So, this is quite the can of worms :) There are several pieces to this, and honestly, I feel like VLANs is kind o... [18:58:42] stephanebisson: does https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/469608/ need a recheck? One test there is red on https://integration.wikimedia.org/zuul/ [18:59:08] aharoni: I don't know if we can or should recheck before the job is finished [19:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T1900). [19:01:45] stephanebisson and aharoni: finish deploying SWAT, I'll wait [19:02:14] stephanebisson: it's done [19:02:22] I don't know if recheck works while it's still running the tests. It should really abort when one test fails IMO [19:02:27] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) FYI, networking considerations being worked out in {T207321} [19:03:20] twentyafterfour: Thanks, we have a patch to re+2 (I just did). Jenkins is super slow. I don't know how long it's gong to be :( [19:03:49] stephanebisson: yeah I've been following along. It's ok. [19:08:45] stephanebisson: https://integration.wikimedia.org/zuul/ red again :( [19:09:16] aharoni: Yeah, not good. I think we should abort mission. [19:09:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Gehel) There are 3 issues here, and maybe they should be addresse... [19:10:07] aharoni: Is it common that CX patches need recheck? [19:10:16] stephanebisson: no :/ [19:10:35] aharoni: how severe is the problem we're trying to fix with those patches? [19:11:12] not something truly urgent. the first one we merged was the really important one. [19:11:17] (merged and deployed) [19:11:19] twentyafterfour: what are the odds that group2 will be on wmf.1 today or this week? [19:11:28] is the train running now for all the groups? [19:11:56] stephanebisson: It all depends on whether all the patches really fix the blockers or if an issue still remains. [19:12:21] I'm not sure yet, I am afraid that T207881 might still remain [19:12:22] T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 [19:12:50] aharoni: so the second patch was merged but not deployed. Should we revert it or deploy it now? [19:12:52] aharoni: I will run group1 and if everything looks good then we'll go to group2 [19:13:07] stephanebisson: deploy please [19:13:11] ok [19:13:24] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [19:14:41] aharoni: it's on mwdebug1001, can you test? [19:15:30] (03PS1) 10Gehel: wdqs-test: switch to kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/469676 [19:16:16] stephanebisson: ack, looking, will be very quick [19:18:14] (03PS2) 10Smalyshev: Re-enable kafka on test [puppet] - 10https://gerrit.wikimedia.org/r/469666 [19:19:35] (03CR) 10Gehel: [C: 032] wdqs-test: switch to kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/469676 (owner: 10Gehel) [19:19:45] stephanebisson: all good [19:19:54] deploying... [19:20:54] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/ContentTranslation/: SWAT: [[gerrit:469603|Add detailed logging for AbuseFilter]] (duration: 00m 56s) [19:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:59] doe [19:21:02] done [19:21:09] And that concludes SWAT [19:21:17] (03PS3) 10Smalyshev: Re-enable kafka on test [puppet] - 10https://gerrit.wikimedia.org/r/469666 [19:21:29] twentyafterfour: ^ thanks for you patience [19:21:46] stephanebisson: You're welcome. [19:21:59] Thanks for swatting. [19:22:31] (03CR) 10Gehel: [C: 032] Re-enable kafka on test [puppet] - 10https://gerrit.wikimedia.org/r/469666 (owner: 10Smalyshev) [19:23:19] !log beginning mediawiki train. Will start with group1 and then monitor the situation for a few minutes. If everything looks good then we go to group2. [19:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:00] !log twentyafterfour@deploy1001 Started scap: full sync to be sure that 1.33.0-wmf.1 is fully deployed [19:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:08] (03PS1) 10Gehel: wdqs: switch to kafka updater for wdqs internal and public clusters [puppet] - 10https://gerrit.wikimedia.org/r/469679 [19:27:54] (03CR) 10Gehel: [C: 032] wdqs: switch to kafka updater for wdqs internal and public clusters [puppet] - 10https://gerrit.wikimedia.org/r/469679 (owner: 10Gehel) [19:29:44] So I've never seen this before: BUG: Bad page map in process hhvm [19:31:10] looks like they are all from mw1272 [19:31:47] mutante: you !logged this error in 2016, any idea what was the cause or the solution? Should we reboot mw1272? [19:33:02] (03PS2) 1020after4: Add .gitreview [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 (owner: 10Hashar) [19:33:16] (03CR) 1020after4: [C: 032] Add .gitreview [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 (owner: 10Hashar) [19:33:29] twentyafterfour: no, i don't. yes, we can reboot it [19:33:54] (03Merged) 10jenkins-bot: Add .gitreview [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 (owner: 10Hashar) [19:34:00] i see you made a ticket already, ack [19:34:03] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 643 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:34:54] yeah that was before I noticed it was all one server. Probably good to document the problem anyway [19:36:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10kostajh) @MoritzMuehlenhoff I'm sorry about that. Here's the new public key: ``` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHZqJOizHso9Yld... [19:43:25] PROBLEM - Apache HTTP on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:24] RECOVERY - Apache HTTP on mw2142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.137 second response time [19:44:40] hmm [19:45:03] !log mw1272 - depooled [19:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:45:42] ^ that memcached alert has been showing up occasionally [19:46:14] it seems like error rate spikes periodically [19:47:02] !log mw1272 - depooled, restarting hhvm (T207983) [19:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:07] T207983: BUG: Bad page map in process hhvm - https://phabricator.wikimedia.org/T207983 [19:47:33] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:50:02] uh, error rates just shot up again and I didn't even go to group1 yet [19:51:10] !log mw1272 - rebooting (a stop job is running for HHVM PH/Hack runtime) (T207983) [19:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:36] (03PS4) 10Sbisson: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 [19:54:22] !log mw1272 - repooled (T207983) [19:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] T207983: BUG: Bad page map in process hhvm - https://phabricator.wikimedia.org/T207983 [19:55:03] PROBLEM - Apache HTTP on mw1270 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:55:03] PROBLEM - HHVM rendering on mw1270 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:56:04] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time [19:56:04] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 74206 bytes in 0.103 second response time [19:56:14] (03PS1) 10Gehel: wdqs: remove wdqs1006 from internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/469685 [19:56:16] (03PS1) 10Gehel: wdqs: add wdqs1006 to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/469686 [19:56:18] (03PS1) 10Gehel: wdqs: remove wdqs1003 from public cluster [puppet] - 10https://gerrit.wikimedia.org/r/469687 [19:56:20] (03PS1) 10Gehel: wdqs: add wdqs1003 to internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/469688 [19:56:45] twentyafterfour: since reboot i have not seen the "BUG" line anymore yet [19:57:03] mutante: yeah it was probably cosmic rays [19:57:43] (03CR) 10Sbisson: [C: 032] "Per Roan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 (owner: 10Sbisson) [19:58:01] I'm more worried about the recurring alerts for 503 and the high rate of sql lock wait timeouts [19:58:26] T207881 [19:58:27] T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 [19:58:43] this is still happenening even with wmf.1 at group0 [19:59:07] (03Merged) 10jenkins-bot: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 (owner: 10Sbisson) [20:02:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:02:58] !log twentyafterfour@deploy1001 Finished scap: full sync to be sure that 1.33.0-wmf.1 is fully deployed (duration: 36m 57s) [20:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:43] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:04:01] uhh [20:04:47] !log still haven't deployed wmf.1 yet error rate increased and icinga is alerting about mediawiki exceptions + wdqs1010 degraded [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:54] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [20:06:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:08:41] Hi, I have problems [20:08:53] zoran@zoran-notebook:~/development/mediawiki$ git fetch [20:08:53] Terminated [20:08:53] zoran@zoran-notebook:~/development/mediawiki$ git fetch && git pull [20:08:53] packet_write_wait: Connection to 208.80.154.85 port 29418: Broken pipe [20:09:02] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 3 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10mmodell) Trizek-WMF: Can yo... [20:10:15] What I should do? [20:10:20] I again got Broken pipe error [20:10:24] 10Operations, 10docker-pkg, 10Patch-For-Review: Allow selecting which images to build - https://phabricator.wikimedia.org/T186416 (10hashar) 05Open>03Resolved a:03Joe Can now be done by using `--select` [20:10:34] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10ayounsi) Should the next step here to make an exhaustive list of the "support services" indicating: server, applicat... [20:12:55] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Tgr) > Based on mw.org, it seems very roughly like a wiki the size of mw.org gets about 30 hits/minute on average, with ocassional spikes to 15... [20:14:27] (03CR) 10jenkins-bot: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 (owner: 10Sbisson) [20:17:36] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Bawolff) > I'd say lets go ahead and do group0 + group1 and see where we're at. Also we have icinga alerts for logstash dropping packets now, i... [20:18:40] Zoranzoki21: I'm not sure [20:19:30] Zoranzoki21: it works for me [20:20:11] twentyafterfour: I tryed with another computer (different IP and internet provider) with same settings.. Same happening [20:20:23] weird [20:20:35] twentyafterfour: Let's talk on releng [20:20:36] twentyafterfour seems to be a internal error [20:21:19] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) The patched source package has been imported into apt.wikimedia.org ``` [install1002:~] $ sudo -i reprepro l... [20:25:05] Oct 25 20:21:33 cobalt java[18243]: log4j:WARN Detected problem with connection: java.net.SocketException: Broken pipe (Write failed) [20:25:15] that's not very helpful [20:25:24] but that's the only java log entry I see [20:25:47] oh [20:26:20] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469709 [20:26:23] (03CR) 1020after4: [C: 032] group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469709 (owner: 1020after4) [20:26:25] (03CR) 10Dzahn: "since this is elukey's suggestion i think he'd also be the best reviewer here" [puppet] - 10https://gerrit.wikimedia.org/r/468865 (https://phabricator.wikimedia.org/T184261) (owner: 10GTirloni) [20:28:05] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469709 (owner: 1020after4) [20:30:07] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.1 refs T206655 [20:30:12] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469709 (owner: 1020after4) [20:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:25] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [20:31:02] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.1 refs T206655 (duration: 00m 54s) [20:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:24] !log db error rate increased again. rolling back [20:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:08] (03Abandoned) 10Dzahn: icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [20:34:42] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469711 [20:34:46] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469711 (owner: 1020after4) [20:35:41] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Mholloway) Tilerator should be resilient to attempting to access a locked DB resource. I'd rather see us handle thi... [20:35:52] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469711 (owner: 1020after4) [20:39:00] (03Abandoned) 10Dzahn: icinga/etcd: /var/run/icinga/ -> /var/run/nagios/ [puppet] - 10https://gerrit.wikimedia.org/r/467017 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:48:01] !log staying at group1, error rate seems to have stabilized [20:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:54] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469769 [20:48:56] (03CR) 1020after4: [C: 032] group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469769 (owner: 1020after4) [20:50:52] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469769 (owner: 1020after4) [20:53:49] (03PS1) 10Ayounsi: DNS: assign public /29 for cloud-instance-transport1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/469771 (https://phabricator.wikimedia.org/T207663) [20:54:10] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469711 (owner: 1020after4) [20:54:12] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469769 (owner: 1020after4) [20:57:44] anomie: are we in the middling of backfilling comments? I see some old revisions have a comment_id of 0 but rev_comment is non-empty [20:59:00] (03CR) 10Smalyshev: [C: 031] wdqs: remove wdqs1006 from internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/469685 (owner: 10Gehel) [20:59:06] (03CR) 10Smalyshev: [C: 031] wdqs: add wdqs1006 to public cluster [puppet] - 10https://gerrit.wikimedia.org/r/469686 (owner: 10Gehel) [20:59:31] musikanimal: We haven't started backfilling comments yet. That's the next step, which is possibly blocked on T189158. [20:59:32] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [20:59:40] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-extensions-Translate, 10Language-Team (Language-2018-October-December), and 3 others: Moving or deleting a translatable page on mediawiki.org triggers an error message - https://phabricator.wikimedia.org/T207930 (10cscott) Verified that {rETR... [21:01:04] (03PS1) 10Faidon Liambotis: nfs-exportd: switch (back) from socket to ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/469772 [21:01:04] (03PS1) 10Faidon Liambotis: nfs-exportd: remove unused parameters from Project [puppet] - 10https://gerrit.wikimedia.org/r/469773 [21:01:08] (03PS1) 10Faidon Liambotis: nfs-exportd: remove the Project class [puppet] - 10https://gerrit.wikimedia.org/r/469774 [21:01:32] rats. Was kind of hoping it'd write to both rev_comment and comment_text until backfilling is done :/ [21:07:02] (03CR) 10Dzahn: "> the unit is can be found at /run/systemd/generator.late/icinga.service. Not sure where it comes from though" [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:07:39] is there any script on prod machines that allows to set downtime easily without digging through icinga in browser? [21:10:48] 10Operations, 10Analytics, 10EventBus, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10mobrovac) [21:13:24] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MGChecker) [21:16:32] (03PS1) 1020after4: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469779 [21:16:34] (03CR) 1020after4: [C: 032] group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469779 (owner: 1020after4) [21:17:15] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Johan) Written, sent to Seddon to make sure the CentralNotice part of it makes sense. [21:19:55] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) >>! In T204047#4695804, @Mholloway wrote: > Tilerator should be resilient to attempting to access a locked... [21:20:09] (03Merged) 10jenkins-bot: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469779 (owner: 1020after4) [21:21:04] (03PS1) 10Cwhite: icinga: install nsca_frack.cfg in objects on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469780 (https://phabricator.wikimedia.org/T202782) [21:21:23] 10Operations, 10Cloud-Services, 10netops, 10Patch-For-Review: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) Thanks for investigating it! See https://gerrit.wikimedia.org/r/c/operations/dns/+/469771 for the IPs, I took the same model as the... [21:25:00] (03CR) 10Bstorm: [C: 04-1] "Overall, doing this causes an unneeded rewrite of two functions as well." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469774 (owner: 10Faidon Liambotis) [21:25:03] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.33.0-wmf.1 refs T206655 [21:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:07] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [21:25:26] (03CR) 10Ayounsi: [C: 032] DNS: assign public /29 for cloud-instance-transport1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/469771 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [21:28:19] (03PS1) 1020after4: group2 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469781 [21:28:23] (03CR) 1020after4: [C: 032] group2 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469781 (owner: 1020after4) [21:29:45] !log configure 208.80.153.185/29 on cr1/2-codfw - T207663 [21:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:49] T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 [21:30:42] (03CR) 10jenkins-bot: group2 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469779 (owner: 1020after4) [21:30:56] (03Merged) 10jenkins-bot: group2 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469781 (owner: 1020after4) [21:31:10] (03CR) 10jenkins-bot: group2 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469781 (owner: 1020after4) [21:35:45] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: rolling back group1 refs T206655 T208000 [21:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:49] T208000: Parser.php: Call to a member function get() on a non-object (null) - https://phabricator.wikimedia.org/T208000 [21:35:50] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [21:38:32] (03PS1) 10Bstorm: sonofgridengine: prepare web exec profiles for grid [puppet] - 10https://gerrit.wikimedia.org/r/469783 (https://phabricator.wikimedia.org/T200557) [21:44:52] (03CR) 10Thcipriani: [C: 032] Switch to Construct for the SSH agent protocol (032 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [21:45:33] (03CR) 10Thcipriani: Switch to Construct for the SSH agent protocol [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [21:45:35] (03PS4) 10Thcipriani: Switch to Construct for the SSH agent protocol [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [21:45:55] (03CR) 10Thcipriani: [C: 032] "Let's try that again." [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [21:46:51] (03Merged) 10jenkins-bot: Switch to Construct for the SSH agent protocol [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [21:47:16] (03CR) 10Cwhite: [C: 031] "Looks like metrics are embedded within pybal itself." [puppet] - 10https://gerrit.wikimedia.org/r/469593 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [21:47:39] (03PS3) 10Thcipriani: Split handle_client_request() into multiple methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/458234 (owner: 10Faidon Liambotis) [21:47:41] (03CR) 10Cwhite: [C: 031] Remove Diamond from LVSes [puppet] - 10https://gerrit.wikimedia.org/r/469594 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [21:48:42] (03CR) 10Bstorm: [C: 031] "Oh very nice! I missed that when I upgraded it to python3 again." [puppet] - 10https://gerrit.wikimedia.org/r/469772 (owner: 10Faidon Liambotis) [21:50:13] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Mholloway) If we're comfortable letting tile generation fail during the period populate_admin() runs (I'm not sure w... [21:51:18] (03CR) 10Thcipriani: [C: 032] Split handle_client_request() into multiple methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/458234 (owner: 10Faidon Liambotis) [21:51:58] (03Merged) 10jenkins-bot: Split handle_client_request() into multiple methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/458234 (owner: 10Faidon Liambotis) [21:52:16] (03PS3) 10Thcipriani: Stop referring to the daemon as a "proxy" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458235 (owner: 10Faidon Liambotis) [21:52:22] (03CR) 10Faidon Liambotis: "I'm not familiar enough to test or validate, so I'd prefer it you +2ed/merged instead :)" [puppet] - 10https://gerrit.wikimedia.org/r/469772 (owner: 10Faidon Liambotis) [21:53:02] (03CR) 10Bstorm: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/469772 (owner: 10Faidon Liambotis) [21:56:44] (03CR) 10Thcipriani: [C: 032] "Patch works. Description should also be updated in setup.py at some point." (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458235 (owner: 10Faidon Liambotis) [21:57:19] (03CR) 10Bstorm: [C: 032] nfs-exportd: switch (back) from socket to ipaddress [puppet] - 10https://gerrit.wikimedia.org/r/469772 (owner: 10Faidon Liambotis) [21:57:25] (03CR) 10Faidon Liambotis: "Yeah, I noticed that and was actually wondering whether I should tag this as RFC :)" [puppet] - 10https://gerrit.wikimedia.org/r/469774 (owner: 10Faidon Liambotis) [21:57:32] (03Merged) 10jenkins-bot: Stop referring to the daemon as a "proxy" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458235 (owner: 10Faidon Liambotis) [21:57:35] (03Abandoned) 10Faidon Liambotis: nfs-exportd: remove the Project class [puppet] - 10https://gerrit.wikimedia.org/r/469774 (owner: 10Faidon Liambotis) [21:59:18] (03PS3) 10Thcipriani: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [21:59:26] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) >>! In T204047#4696074, @Mholloway wrote: > If we're comfortable letting tile generation fail during the pe... [22:03:19] (03PS2) 10Bstorm: sonofgridengine: prepare web exec profiles for grid [puppet] - 10https://gerrit.wikimedia.org/r/469783 (https://phabricator.wikimedia.org/T200557) [22:05:09] (03CR) 10Bstorm: [C: 032] sonofgridengine: prepare web exec profiles for grid [puppet] - 10https://gerrit.wikimedia.org/r/469783 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:13:35] (03CR) 10Bstorm: "It looks like something was intended there but never was used... Will look at this some more." [puppet] - 10https://gerrit.wikimedia.org/r/469773 (owner: 10Faidon Liambotis) [22:25:34] (03PS1) 10Bstorm: sonofgridengine: Add new roles for stretch grid web nodes [puppet] - 10https://gerrit.wikimedia.org/r/469790 (https://phabricator.wikimedia.org/T200557) [22:27:43] (03PS1) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) [22:30:06] (03CR) 10Dzahn: [C: 032] "thanks! https://puppet-compiler.wmflabs.org/compiler1002/13211/" [puppet] - 10https://gerrit.wikimedia.org/r/469780 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:30:49] (03PS2) 10Dzahn: icinga: install nsca_frack.cfg in objects on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469780 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:32:24] (03PS2) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) [22:34:21] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/13213/" [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [22:41:59] (03CR) 10Dzahn: [C: 032] "we have 2852 hosts on both jessie and stretch now :)" [puppet] - 10https://gerrit.wikimedia.org/r/469780 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:52:57] (03PS4) 10Dzahn: icinga: on stretch, use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) [22:53:51] (03CR) 10jerkins-bot: [V: 04-1] icinga: on stretch, use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [22:54:01] jouncebot: now [22:54:01] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [22:54:04] jouncebot: Nemo_bis [22:54:06] :/ [22:54:09] jouncebot: next [22:54:10] In 0 hour(s) and 5 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T2300) [22:59:32] twentyafterfour: What code for CN is actually in prod? [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181025T2300). [23:00:04] AndyRussG and MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:29] (03CR) 10Cwhite: [C: 031] "Based on the experience with NSCA, it's likely that systemd is generating the unit on installation. We should manage it." [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:00:44] James_F: what do you mean? what version of CN? [23:00:53] Yeah. [23:01:09] I'm reading the code off deployment.eqiad.wmnet now. [23:01:25] It doesn't match master. [23:01:34] Here. I can deploy [23:01:50] AndyRussG: yt? [23:01:55] James_F: it's the head of the wmf_deploy [23:01:57] MaxSem: yep! [23:02:02] AndyRussG: Eurgh. [23:02:11] James_F: well put ;p [23:02:14] AndyRussG: This deprecated code was fixed four months ago in I5205ec0d96cb06087624f2cf8d83b8ae2256df0e. [23:02:16] there's a task for that [23:02:20] AndyRussG: This is Not Helpful™. [23:02:21] yes [23:02:25] yes [23:02:39] apologies [23:02:47] James_F: uh [23:02:52] No commit to deploy [23:03:00] MaxSem: hmm? [23:03:09] AndyRussG: Should I cherry-pick the fix for the UBN to the wmf-deploy branch? [23:03:11] CentralNotice running super old version is nothing new [23:03:21] James_F: I did [23:03:25] Never mind [23:03:45] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/469794/ [23:03:48] Ta. [23:03:50] Lemme see what the -1 is [23:03:55] I did test it locally [23:04:17] bawolff_: For some values of "super". [23:04:32] MaxSem: James_F: the -1 on the Gerrit change is just a flapping QUnit test [23:04:37] nothing changed there [23:06:02] twentyafterfour: Sorry about this. :-( [23:06:17] James_F: no apologies necessary [23:06:19] (03CR) 10Thcipriani: [C: 04-1] "Couple of minor problems in the keyholder bash script" (032 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [23:06:59] James_F: twentyafterfour all the apologies are mine 8p [23:07:16] Thanks for making that task! I hadn't seen it... [23:07:31] Why was this deprecated and removed so quickly anyways? [23:08:50] I screwed up. [23:08:57] bawolff_: Over six months? [23:09:14] According to the task it was soft-deprecated in 1.31 [23:09:40] Yes. [23:09:48] usually there's at least one release of soft-deprecation, and one release of hard-deprecated, at least for a super commonly called method [23:09:51] Soft in 1.31, hard in 1.32, removed in 1.33. [23:10:04] We're in 1.33 now. [23:10:07] Oh right, we're 1.33 [23:10:12] sorry, forgot [23:10:29] That's a much more reasonable deprecation path [23:10:31] But I should have remembered that codesearch lies about things by assuming master is used. :-) [23:12:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] service::node: Set config-vars.yaml's mode to 0440 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [23:13:00] (03PS1) 10Addshore: Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 [23:13:07] (03CR) 10Thcipriani: [C: 031] Add permission checks for various commands [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 (owner: 10Faidon Liambotis) [23:13:31] (03PS2) 10Addshore: Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 [23:13:49] \o could someone give me a heads up once swat is done? :) [23:14:03] James_F: we can still try to get it turned on today :) [23:14:10] addshore: OK… [23:15:35] After you are done that, there is a thingy I want to deploy as well [23:15:56] is swat happening? [23:15:58] bawolff_: James_F all definitely my fault for not pushing out these accumulated CN changes sooner, and also for not getting to fixing the CN deploy setup [23:16:01] bawolff_: You first, we're deploying stuff to Beta Cluster so need to wait. [23:16:11] James_F: also, https://phabricator.wikimedia.org/T207683, new property should also not be listed etc right? [23:16:30] James_F: I'm not ready yet, and I still have to do some things first, you should definitely go first [23:16:30] addshore: I think MaxSem is deploying? [23:16:35] AndyRussG: ack [23:16:46] (03CR) 10WMDE-leszek: [C: 031] "looks legit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 (owner: 10Addshore) [23:16:57] Deploying is such a shiny word for "waiting for Zuul" [23:17:10] hehehe [23:17:15] *agrees* [23:17:20] well, jenkins ;) [23:17:47] AndyRussG: Totally understandable. It's just that each team has a special thing it does that makes every other team have difficulties (FR has slow releases for security/stability; Web has webpack pre-built files; Growth has templates(!); Perf has deploy-on-Sunday-night; Lang has support-for-two-years; etc.) [23:18:12] Each team makes a local maxima choice, for good reasons, it just disrupts the rest of us. :-( [23:19:39] Fire, fire! Kill it all with fire! :P [23:20:36] (03CR) 10Thcipriani: [C: 04-1] "> 2. Might want to check permissions before handling remove(_all) or" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [23:22:30] James_F: yep... mmm one sec, lemme find that Fab dask [23:22:32] task [23:23:39] James_F: here's at least one (maybe there's more): https://phabricator.wikimedia.org/T113428 [23:26:11] (03CR) 10Jforrester: [C: 031] Define and specify lexeme NS for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469796 (owner: 10Addshore) [23:26:12] !log maxsem@deploy1001 Synchronized php-1.33.0-wmf.1/extensions/GlobalPreferences/: https://gerrit.wikimedia.org/r/c/469793/ (duration: 00m 58s) [23:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:19] addshore: Also, yes, we don't want CreateProperty on Commons right now, but it's OK to have it listed (as long as it's inoperable), as we will want it quite soon. [23:27:47] James_F: "as long as it's inoperable", yeh, for that reason I think it is safer to just get rid of it for now :) [23:27:51] its less scary that way [23:28:01] addshore: Focussing on the actually-wrong things in T207683 [23:28:03] T207683: Wikibase Repo api modules and special pages should be conditionally loaded based on entity types enabled? - https://phabricator.wikimedia.org/T207683 [23:28:57] AndyRussG: pulled on mwdebug1002 [23:29:42] MaxSem: ok checking [23:32:02] MaxSem: internal server error [23:32:13] Also the request took a long time [23:32:31] Are you testing og wmf.26? [23:32:36] s/og/on/ [23:32:39] MaxSem: oh wait now it worked [23:32:41] UBN bug fix https://gerrit.wikimedia.org/r/c/mediawiki/core/+/469798 [23:32:50] not sure why the site is up with this bug active [23:33:10] interesting [23:33:13] MaxSem: no, wmf.1, which is what's on Meta, where the bug occurrs. https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners [23:33:19] MaxSem: it just worked now [23:33:25] maybe it is one for MaxSem? [23:33:52] TimStarling: because I rolled back [23:34:07] twentyafterfour: its still on the other sites though right? you didn't roll back all of the groups? [23:34:27] MaxSem: I don't have any way to test for wmf.26, but I thought it was sane to syncrhonize versions [23:34:40] addshore: I did not roll back all of the groups, I rolled back to group1 because the error rate was low at group1 [23:34:50] maybe the request rate is low? [23:34:53] AndyRussG: wmf.26 is enwiki [23:35:25] TimStarling: right [23:35:41] AndyRussG: so... Are we ready to deploy? [23:35:48] MaxSem: yeah... just it's I think an issue for future deploys prior to the next train to have CN synced not synced to all versions [23:36:00] MaxSem: Did you see what the server error was? [23:36:07] maybe mwdebug1002 just timed out because overloaded? [23:36:08] (03PS1) 10Brian Wolff: Enable CSP-report-only for logged in/session having users on enwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469800 (https://phabricator.wikimedia.org/T207900) [23:36:14] I didn't see the error in logstash [23:36:32] dunno if mwdebug servers' logs go there [23:36:32] Your error is Call to undefined method LanguageEn::truncate() in /srv/mediawiki/php-1.33.0-wmf.1/extensions/CentralNotice/special/SpecialCentralNotice.php on line 1543 [23:36:42] (03PS3) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) [23:36:59] TimStarling: 10 errors in 15 minutes is pretty low [23:37:00] There are a bunch of timeouts, so a particular one is pretty hard to identify [23:37:02] MaxSem: on mwdebyg1002? [23:37:10] Maybe it was just that wmf.1 wasn't synced yet [23:37:27] That's the same error that we're fixing for [23:37:34] I did a full scap earlier today so everything should be sync'd up [23:37:35] MaxSem: looks fine now, I'd say deploy [23:37:38] Hmm, try now? [23:37:44] (03PS4) 10Faidon Liambotis: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 [23:37:46] (03PS3) 10Faidon Liambotis: Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 [23:37:48] (03PS3) 10Faidon Liambotis: Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 [23:37:50] (03PS3) 10Faidon Liambotis: Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 [23:37:53] (03PS3) 10Faidon Liambotis: Add permission checks for various commands [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 [23:37:54] MaxSem: yep still all good [23:37:54] (03PS3) 10Faidon Liambotis: Verify the validity of signature requests [software/keyholder] - 10https://gerrit.wikimedia.org/r/458241 [23:37:57] (03PS3) 10Faidon Liambotis: Implement SSH_AGENTC_LOCK/SSH_AGENTC_UNLOCK [software/keyholder] - 10https://gerrit.wikimedia.org/r/458242 [23:37:59] (03PS3) 10Faidon Liambotis: Parse/build agent request/responses once [software/keyholder] - 10https://gerrit.wikimedia.org/r/458243 [23:38:01] (03PS3) 10Faidon Liambotis: Refactor handle() [software/keyholder] - 10https://gerrit.wikimedia.org/r/458244 [23:38:02] (03PS3) 10Faidon Liambotis: Add compatibility with Construct 2.8.22 and 2.9.45 [software/keyholder] - 10https://gerrit.wikimedia.org/r/458245 [23:38:05] (03PS3) 10Faidon Liambotis: Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 [23:38:06] (03PS3) 10Faidon Liambotis: Unlink the Unix domain socket when exiting [software/keyholder] - 10https://gerrit.wikimedia.org/r/458247 [23:38:08] (03PS3) 10Faidon Liambotis: Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 [23:38:11] (03PS3) 10Faidon Liambotis: Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 [23:38:24] pls go ahead and deploy anytime [23:39:23] !log maxsem@deploy1001 Synchronized php-1.33.0-wmf.1/extensions/CentralNotice/: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/469794/ (duration: 00m 57s) [23:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:29] (03CR) 10Faidon Liambotis: "1. Really good point. Hadn't thought of that!" (032 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [23:39:32] AndyRussG: ^ [23:40:14] MaxSem: yep all good! [23:40:18] Thanks so much!!!! :) [23:40:32] Whee [23:40:36] (03CR) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [23:40:46] heh indeed [23:41:02] (03CR) 10jerkins-bot: [V: 04-1] Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [23:41:53] is that all for swat? I need to deploy Tim's patch asap [23:42:57] AndyRussG: No need to push I5205ec0d96cb06087624f2cf8d83b8ae2256df0e to 1.32.0-wmf.26, right? [23:43:15] The function called was only dropped in 1.33.0-wmf.1. [23:43:23] twentyafterfour: Clear, go for it. [23:43:27] (03PS4) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) [23:44:27] well crud, still hasn't merged so I guess I was a little ahead of myself there [23:44:32] it's almost through CI [23:44:33] :-D [23:45:24] James_F: As regards actual site functionality goes, that's correct, no need to. As regards deployment procedure... I've heard it's a pain patches merged on the CN wmf_deploy branch don't get deployed to all live versions [23:45:41] because the submodule just points to the head of wmf_deploy [23:45:50] Oh, right. Let me look. [23:46:01] so if someone comes along to deploy something else to wmf.26 then they're confused 'cause there's undeployed stuff [23:46:03] something like that [23:46:38] submodules always just point to a detached head [23:46:38] (03PS5) 10Mobrovac: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) [23:46:56] unless you explicitly check out the branch within the submodule [23:46:58] Yeah, `diff php-1.33.0-wmf.1/extensions/CentralNotice/special/SpecialCentralNotice.php php-1.32.0-wmf.26/extensions/CentralNotice/special/SpecialCentralNotice.php` is not-empty. [23:47:16] twentyafterfour: Want to fix? [23:47:18] so, techconf is wrapping up, I will need to get https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/469796/ out of the door once the UBN things are deployed [23:47:28] Sorry addshore . [23:47:32] Too much excitement. [23:47:39] James_F: fix which? [23:47:44] as beta wikidata will be broken until I do :), James_F yup, thats fine, I can come back in a little bit [23:47:55] twentyafterfour: Fix the submodule for wmf.26. [23:48:08] (03PS2) 10Alexandros Kosiaris: Add chart to pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/469661 [23:48:10] (03PS2) 10Alexandros Kosiaris: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [23:48:22] James_F: I'm not sure what it's supposed to point to? [23:48:30] (03PS5) 10Faidon Liambotis: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 [23:48:34] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler1002/13217/" [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [23:48:40] should it be the same as 1.33.0-wmf.1? [23:48:41] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13215/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/467978 (owner: 10Muehlenhoff) [23:48:52] It should, yes, and it isn't. [23:48:56] I'll fix, one mo. [23:48:58] ok I can fix that [23:49:00] oh [23:49:44] AndyRussG: wmf.26 live on mwdebug1002. [23:49:49] AndyRussG: Does it need further testing? [23:50:04] James_F: heh no, we can't actually test this change there [23:50:12] I mean, other than checking that the site isn't down [23:50:16] Oh, right, it's only used on Meta. [23:50:20] yeah [23:50:41] Site is indeed not down via mwdebug1002. [23:50:44] OK, I'll sync. [23:51:14] James_F: yeah sounds great! thanks!!! [23:52:28] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) It has been fixed by @colewhite and by adding the binaries like so: ``` 17:05 < moritzm> reprepro -C main in... [23:53:16] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/CentralNotice/special/SpecialCentralNotice.php: SWAT Sync versions of SpecialCentralNotice to avoid dirty repo checkout T208004 (duration: 00m 56s) [23:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:21] T208004: Call to undefined method LanguageEn::truncate() - https://phabricator.wikimedia.org/T208004 [23:53:28] twentyafterfour: Conch is yours. Just in time for jenkins to merge 469799. :-) [23:53:44] phpunit is at 58% [23:53:54] * James_F sighs. [23:53:57] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) Mostly resolved but we might want to keep it open for the "suggest to upstream maintainer" part too. [23:54:07] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10Dzahn) p:05Triage>03Normal [23:54:08] Can we paint some go-faster stripes on the RAM sticks or something? [23:54:21] * twentyafterfour thinks we need more jenkins slaves [23:54:30] some go-faster stripes would be nice as well [23:54:38] Dedicated ones that site around waiting for deployment patches would be nice. [23:54:51] "Wasted" but fast when we need it. [23:55:30] like these? https://en.wikipedia.org/wiki/Heat_spreader [23:55:31] Or ultra-fast nodes that dumped their in-progress tests when a higher-priority task came along. [23:55:40] But that'd be… a messy hand-off. [23:57:17] and merged [23:57:28] !log deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/469799/ [23:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:29] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.1/includes/parser/Parser.php: deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/469799/ refs T208000 (duration: 00m 56s) [23:59:57] ok done. addshore did you have something to deploy as well?