[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171205T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:19:27] (03PS1) 10Dzahn: ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 [00:20:11] (03CR) 10jerkins-bot: [V: 04-1] ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 (owner: 10Dzahn) [00:20:32] (03PS2) 10Dzahn: ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 [00:20:59] (03PS3) 10Dzahn: ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 (https://phabricator.wikimedia.org/T177225) [00:21:04] (03CR) 10jerkins-bot: [V: 04-1] ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:22:22] what, still [00:22:35] no, we just dont get the positive news :) [00:23:03] (03CR) 10Dzahn: [C: 032] ulsfo: lvs, bastion, remove ganglia and aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395156 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:28:54] (03CR) 10Krinkle: [C: 04-1] webperf1001/2001 start using webperf role [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [00:29:14] PROBLEM - DPKG on bast4001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:29:45] !log bast4001 - removing ganglia aggregators, package, config... [00:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:19] RECOVERY - DPKG on bast4001 is OK: All packages OK [00:41:59] 10Operations, 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3704693 (10Eevans) [00:47:05] (03PS1) 10Dzahn: snapshot,prometheus,maintenance,otrs,archive: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395167 (https://phabricator.wikimedia.org/T177225) [00:55:22] 10Operations, 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3811294 (10Eevans) a:03Eevans I've taken a stab at this, but since I'm not 100% certain what the intention was, procedure-wise, I've pushed it to https://github.com/eevan... [01:55:00] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: / (root with wrong query param) timed out before a response was received: /_info/home (redirect to the home page) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received: /{domain}/v1/feed/trending-edits{/period} (retrieve t [01:55:00] thin the last hour) timed out before a response was received [01:56:59] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [02:15:30] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001583, end_log_pos 976316916 [02:27:49] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.34 seconds [02:29:43] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.10) (duration: 06m 36s) [02:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:22] (03PS1) 10Ayounsi: DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 [03:06:35] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 (owner: 10Ayounsi) [03:11:00] (03PS2) 10Ayounsi: DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 [03:11:56] (03PS3) 10Ayounsi: DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 [03:13:33] (03CR) 10Ayounsi: [C: 032] DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 (owner: 10Ayounsi) [03:15:26] (03PS4) 10Ayounsi: DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 [03:23:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 783.06 seconds [03:28:38] (03CR) 10Ayounsi: [C: 032] DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 (owner: 10Ayounsi) [03:28:44] (03PS5) 10Ayounsi: DNS: Add eqsin networking [dns] - 10https://gerrit.wikimedia.org/r/395174 [03:40:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [03:46:50] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.28 seconds [03:54:49] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [03:59:10] RECOVERY - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.135 port 9042 [04:51:30] (03PS1) 10KartikMistry: apertium: Update for new hfst [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/395176 (https://phabricator.wikimedia.org/T181464) [04:58:48] (03CR) 10KartikMistry: [C: 04-1] "Depends on: https://gerrit.wikimedia.org/r/394967" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/395176 (https://phabricator.wikimedia.org/T181464) (owner: 10KartikMistry) [05:43:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [05:43:39] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [06:10:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395181 [06:10:50] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395181 [06:13:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395181 (owner: 10Marostegui) [06:14:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395181 (owner: 10Marostegui) [06:15:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 - T174569 (duration: 00m 44s) [06:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:58] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:16:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395182 (https://phabricator.wikimedia.org/T174569) [06:17:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395181 (owner: 10Marostegui) [06:17:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395182 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:18:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395182 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:20:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T174569 (duration: 00m 43s) [06:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395182 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:20:27] !log Deploy schema change on db1090 (s2) - T174569 [06:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:07] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3811546 (10Marostegui) [06:28:22] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171683 (10Marostegui) [06:29:20] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3811555 (10Marostegui) [06:29:22] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171683 (10Marostegui) 05Open>03Resolved All these servers have been decommissioned or have individual tasks for decommissioning as described above, so clo... [06:31:21] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm-dump-debug] [06:32:20] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:39] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2064502 [06:32:59] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/dhparam.pem] [06:43:00] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3811557 (10Groovier) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKq1Rl0RHmHoRYDmixvJZSlUYs8BXLDfIEbfUy6FmOwQU0BuAbAFSA8JvJwvTqpb7ptu705U/ZXbtD0SlfK... [06:48:00] !log Fix dbstore1002 replication [06:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:19] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:50:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395183 (https://phabricator.wikimedia.org/T178359) [06:53:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395183 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:54:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395183 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:55:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1034 - T178359! (duration: 00m 43s) [06:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:46] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:56:20] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395183 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:57:20] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:47] !log Stop MySQL on db1034 to clone db1098:3317 - T178359 [06:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:17] hmm, anyone else having trouble accessing phabricator? I'm getting timeouts... [07:02:31] legoktm: works for me [07:03:06] * legoktm pokes his browser harder [07:03:10] working again, that was wird [07:03:12] weird* [07:23:50] !log kartik@tin Started deploy [cxserver/deploy@4b74f03]: Update cxserver to 1693bcf [07:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:12] !log kartik@tin Finished deploy [cxserver/deploy@4b74f03]: Update cxserver to 1693bcf (duration: 03m 22s) [07:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:50] (03PS2) 10Giuseppe Lavagetto: puppetdb: add command-processing threads setting to puppetdb::app [puppet] - 10https://gerrit.wikimedia.org/r/395080 (https://phabricator.wikimedia.org/T179722) (owner: 10Herron) [07:32:52] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdb: add command-processing threads setting to puppetdb::app [puppet] - 10https://gerrit.wikimedia.org/r/395080 (https://phabricator.wikimedia.org/T179722) (owner: 10Herron) [07:37:49] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 279.00 seconds [07:37:52] (03PS3) 10Thiemo Mättig (WMDE): Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [07:39:17] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "PS3 is a rebase. I also tried to minimize the diff as much as possible, which means I removed a "." from the end of an otherwise untouched" [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [07:46:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [07:50:00] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [07:50:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [08:08:32] !log installing python updates on trusty [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [08:15:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [08:24:49] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1259.eqiad.wmnet [08:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:15] (03PS1) 10Hashar: contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395198 (https://phabricator.wikimedia.org/T181938) [08:25:38] (03CR) 10Hashar: "My fault sorry. I should have run the puppet compiler!" [puppet] - 10https://gerrit.wikimedia.org/r/395118 (owner: 10Dzahn) [08:30:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395199 [08:30:10] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395199 [08:31:06] (03CR) 10Hashar: "Better now: https://puppet-compiler.wmflabs.org/compiler02/9160/" [puppet] - 10https://gerrit.wikimedia.org/r/395198 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [08:52:16] (03PS12) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [08:54:00] !log enabling test production traffic for mw1259 (stretch-based video scaler) [08:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395199 (owner: 10Marostegui) [08:56:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395199 (owner: 10Marostegui) [08:57:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090 - T174569 (duration: 00m 44s) [08:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:42] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:58:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395473 (https://phabricator.wikimedia.org/T174569) [08:59:32] (03PS13) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [09:00:10] (03PS1) 10ArielGlenn: fix subdir name in misc dumps cleanup, make script quieter [puppet] - 10https://gerrit.wikimedia.org/r/395474 [09:00:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395473 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:01:34] (03CR) 10ArielGlenn: [C: 032] fix subdir name in misc dumps cleanup, make script quieter [puppet] - 10https://gerrit.wikimedia.org/r/395474 (owner: 10ArielGlenn) [09:01:52] !log Deploy schema change on db1076 (s2) - T174569 [09:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395473 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:03:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 - T174569 (duration: 00m 43s) [09:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:34] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:07:19] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:09:51] (03PS14) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [09:12:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395199 (owner: 10Marostegui) [09:12:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395473 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:13:24] RECOVERY - mysqld processes on db1098 is OK: PROCS OK: 2 processes with command name mysqld [09:14:11] marostegui: known? ^ [09:14:20] yeah it is a recovery from yesterday [09:14:27] ok :) [09:14:37] :-) [09:17:57] (03CR) 10Elukey: [C: 032] role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [09:19:31] (03PS4) 10Addshore: Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [09:20:32] (03CR) 10ArielGlenn: "Isn't this more privileges for the group on the various hosts, than were approved? E.g. this would give strace * on varnish hosts." [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [09:20:59] !log Optimize s7 on db1098 - T178359 [09:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:09] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:25:17] (03PS5) 10Ema: VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) [09:25:58] (03CR) 10Ema: [C: 032] VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [09:26:10] (03PS1) 10Marostegui: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395476 [09:26:24] (03CR) 10Marostegui: [C: 04-2] "Wait for the lag to be gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395476 (owner: 10Marostegui) [09:26:26] don't suppose anyone feels like merging a small puppet change for a cronjob for wikidata change dispatching? [09:26:34] (03CR) 10Addshore: [C: 031] Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [09:28:03] addshore: linky? [09:28:12] ^^ the one i just +1ed :) [09:28:20] ah [09:29:44] let me have a look at the script a bit first [09:30:11] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-General-or-Unknown, and 3 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3811728 (10Marostegui) @EddieGP any update on this? is this resolved (as per your submitted patch?) [09:31:47] (03PS4) 10Alexandros Kosiaris: Update to 1.7.11 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [09:34:12] apergos: the param being removed from the cron is actually about to be removed from the code :) [09:34:46] (03PS2) 10Filippo Giunchedi: hieradata: enabled restbase1014-b for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395085 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [09:35:46] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enabled restbase1014-b for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395085 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [09:39:35] !log bootstrap restbase1014-b - T179422 [09:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:45] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:42:06] 10Operations, 10Ops-Access-Requests: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3811749 (10Gehel) [09:42:31] 10Operations, 10Ops-Access-Requests: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3811762 (10Gehel) [09:42:32] !log reboot analytics100[12] for kernel+jvm updates (Hadoop Master nodes) - T179943 [09:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:42] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [09:42:44] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3811766 (10mobrovac) [09:42:48] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3811764 (10mobrovac) [09:42:59] (03PS5) 10Alexandros Kosiaris: Update to 1.7.11 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [09:46:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395476 (owner: 10Marostegui) [09:46:36] (03PS5) 10ArielGlenn: Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [09:46:39] addshore: this lgtm, I'll merge and run puppet on terbium then [09:46:48] Thanks! [09:47:25] (03CR) 10ArielGlenn: [C: 032] Wikidata dispatcher: Choose a better value for --randomness [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [09:48:00] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395476 (owner: 10Marostegui) [09:48:08] apergos: I'm tailing the logs and watching stuff :) [09:48:11] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1034 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395476 (owner: 10Marostegui) [09:48:37] (03PS2) 10Muehlenhoff: Install lilypond from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/394963 [09:49:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1034 with low weight - T178359! (duration: 00m 43s) [09:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:13] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:49:50] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3811770 (10Cparle) Awesome, thanks v much @Dzahn [09:50:16] addshore: puppet run finished [09:50:41] thanks apergos, I'll make sure the next cron fire goes fine and keep an eye on it all day [09:51:22] cron fired just fine! [09:53:34] good! [09:56:01] (03CR) 10Muehlenhoff: [C: 032] Install lilypond from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/394963 (owner: 10Muehlenhoff) [09:57:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [09:57:50] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [09:58:04] it seems cp4021 cron the ints [09:58:24] looking [09:58:32] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&var-server=All [09:58:35] ema: --^ [09:58:51] it was also unhappy before [10:00:00] !log cp4021: restart varnish-be due to mbox lag/fetch failures [10:00:09] (03PS1) 10Muehlenhoff: Revert "Install lilypond from stretch-backports" [puppet] - 10https://gerrit.wikimedia.org/r/395480 [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:24] (03PS1) 10Gehel: admin: allow maps-admins access to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) [10:00:58] (03CR) 10Muehlenhoff: [C: 032] Revert "Install lilypond from stretch-backports" [puppet] - 10https://gerrit.wikimedia.org/r/395480 (owner: 10Muehlenhoff) [10:01:45] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395482 [10:02:32] (03PS2) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [10:03:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [10:03:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395482 (owner: 10Marostegui) [10:04:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395482 (owner: 10Marostegui) [10:05:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1034 - T178359! (duration: 00m 43s) [10:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:05] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:07:12] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395482 (owner: 10Marostegui) [10:07:51] !log stopping dbstore1002 (s5) and dbstore2001 (s5) for maintenance [10:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:09] (03PS2) 10Gehel: admin: allow Paul Norman (pnorman) to deploy kartotherian / tilerator [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) [10:09:09] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [10:09:19] (03CR) 10Muehlenhoff: [WIP] php7 manifests for mediawiki on stretch (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [10:09:50] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [10:10:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [10:10:10] (03CR) 10Muehlenhoff: [C: 031] "Looks good (but needs meeting approval)" [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) (owner: 10Gehel) [10:10:26] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3811819 (10Gehel) [10:11:17] (03CR) 10Gehel: [C: 04-1] "waiting for approval in weekly Ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) (owner: 10Gehel) [10:12:59] (03CR) 10Alexandros Kosiaris: [C: 031] "Not sure why src and binary packages get different applications, but I guess you got a reason" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394619 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:15:35] (03PS3) 10Addshore: Remove wikidatabuilder LABS ONLY [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) [10:15:41] (03PS4) 10Addshore: Remove wikidatabuilder LABS ONLY [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) [10:15:48] apergos: if you fancy another easy one ^^ [10:15:50] will be a noop [10:16:19] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused Filippo Giunchedi bootstrapping [10:16:26] (03Abandoned) 10Addshore: Stop using extension-list-wikidata from Wikidata build for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391251 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [10:18:57] (03PS1) 10Muehlenhoff: Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 [10:21:40] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3811846 (10MarcoAurelio) I must apologize in advance if my impression is not right, but I don't think being combative is helpful here @StevenJ81 (... [10:22:39] addshore: I'd like to make sure there's no lab instances using it, before we take it away. [10:22:47] ack! [10:23:19] There really shouldn't be, it's a very limited usecase class [10:27:28] (03PS6) 10Addshore: extension-list extension.json entrypoint for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394283 [10:29:51] (03PS1) 10Addshore: Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 [10:29:53] (03PS1) 10Addshore: Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 [10:29:55] (03PS1) 10Addshore: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 [10:30:36] (03CR) 10Addshore: [C: 032] extension-list extension.json entrypoint for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394283 (owner: 10Addshore) [10:31:57] (03Merged) 10jenkins-bot: extension-list extension.json entrypoint for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394283 (owner: 10Addshore) [10:32:37] (03CR) 10jenkins-bot: extension-list extension.json entrypoint for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394283 (owner: 10Addshore) [10:33:06] !log rebooting tegmen for update to 4.9,51 [10:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:18] !log addshore@tin Synchronized wmf-config/extension-list: [[gerrit:394283|extension-list extension.json entrypoint for ArticlePlaceholder]] (duration: 00m 43s) [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:52] addshore: I've verified that the role and the class it uses aren't used in labs (well, so says https://tools.wmflabs.org/openstack-browser/puppetclass/ and that's good enough) [10:38:57] Thanks apergos ! [10:39:35] (03CR) 10ArielGlenn: [C: 032] Remove wikidatabuilder LABS ONLY [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) (owner: 10Addshore) [10:40:52] yw [10:45:53] !log reboot druid1003 for kernel+jvm updates - T179943 [10:45:58] !log rebooting einsteinium (icinga.wikimedia.org) for update to 4.9,51 [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [10:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:35] (03CR) 10Jonas Kress (WMDE): [C: 031] Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [10:48:21] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3811940 (10fgiunchedi) [10:48:24] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565#3811936 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Metrics are being collected, I started a dashboard at https://grafana.wikimedia.org/dashboard/db/mail (th... [10:50:31] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [10:51:59] hashar: could you take a look at https://gerrit.wikimedia.org/r/#/c/394551 and its task when you get a chance? thanks! [10:52:40] (03PS3) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [10:53:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [10:53:30] !log rebooting meitnerium/archiva.wikimedia.org for update to 4.9.51 [10:53:34] (03PS5) 10Ema: mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) [10:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] (03CR) 10Ema: [V: 032 C: 032] mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [10:54:35] (03PS4) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [10:58:41] (03CR) 10Gehel: "very minor comment inline. Otherwise, lgtm." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [10:59:16] (03PS2) 10Gehel: Fix deleting old categories [puppet] - 10https://gerrit.wikimedia.org/r/395082 (owner: 10Smalyshev) [10:59:30] (03CR) 10jerkins-bot: [V: 04-1] Fix deleting old categories [puppet] - 10https://gerrit.wikimedia.org/r/395082 (owner: 10Smalyshev) [11:00:42] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/395082 (owner: 10Smalyshev) [11:00:59] (03PS1) 10Muehlenhoff: Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 [11:01:01] (03CR) 10jerkins-bot: [V: 04-1] Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 (owner: 10Muehlenhoff) [11:01:03] (03PS2) 10Muehlenhoff: Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 [11:01:07] (03CR) 10Gehel: [C: 032] Fix deleting old categories [puppet] - 10https://gerrit.wikimedia.org/r/395082 (owner: 10Smalyshev) [11:01:53] (03CR) 10Muehlenhoff: [C: 032] Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 (owner: 10Muehlenhoff) [11:02:06] (03PS3) 10Muehlenhoff: Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 [11:02:10] (03CR) 10jerkins-bot: [V: 04-1] Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 (owner: 10Muehlenhoff) [11:03:17] hashar: CI seems broken due to depleted disk space, I'm getting "Building remotely on integration-slave-docker-1001 (m1executor DebianJessieDocker) in workspace /srv/jenkins-workspace/workspace/operations-puppet-tests-docker FATAL: Unable to produce a script file java.io.IOException: No space left on device [11:03:36] (03CR) 10Muehlenhoff: [V: 032 C: 032] Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 (owner: 10Muehlenhoff) [11:07:41] (03PS1) 10ArielGlenn: remove extraneous cleanups for misc dumps from scripts and crons [puppet] - 10https://gerrit.wikimedia.org/r/395495 [11:08:28] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1925 bytes in 0.121 second response time [11:09:36] addshore: any chance this is that merge? [11:09:53] apergos: right now it could just be regular load [11:10:01] I'm watching it and will ping you if I need a revert :) [11:10:06] ok! [11:10:53] (03CR) 10Filippo Giunchedi: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [11:14:38] (03CR) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [11:23:28] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1956 bytes in 0.130 second response time [11:25:50] (03PS2) 10Giuseppe Lavagetto: Puppet 4 compatibility [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/394981 [11:25:58] akosiaris: Is there any issue with scb1001? I saw lots of 'Worker xx died.. restarting' errors in logstash. [11:26:06] (cxserver) [11:30:28] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1935 bytes in 0.114 second response time [11:35:17] <_joe_> uhm [11:37:55] <_joe_> the statistics page is ok as far as I can tell [11:38:05] <_joe_> not sure what it's searching for btw [11:38:46] <_joe_> ok, there is some lag [11:38:50] <_joe_> 337 s [11:39:11] (03CR) 10Muehlenhoff: [V: 032 C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/395484 (owner: 10Muehlenhoff) [11:41:23] <_joe_> Amir1: there is some lag in the wikidata dispatch, any suggestions on what could be going wrong? [11:41:31] <_joe_> hoo: ping too :P [11:41:53] _joe_: We're already looking into it [11:42:01] <_joe_> hoo: ok cool :P [11:43:10] (03Abandoned) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394557 (owner: 10Muehlenhoff) [11:44:31] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395502 [11:50:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395502 (owner: 10Marostegui) [11:50:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Puppet 4 compatibility [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/394981 (owner: 10Giuseppe Lavagetto) [11:52:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Here's a first round of comments. I haven't yet reviewed templates so I owe that at the very least" (0317 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [11:53:01] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395502 (owner: 10Marostegui) [11:53:14] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395502 (owner: 10Marostegui) [11:53:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395503 [11:53:39] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395503 [11:53:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1034 - T178359! (duration: 00m 43s) [11:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:03] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [11:55:28] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1939 bytes in 0.089 second response time [11:57:29] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395503 (owner: 10Marostegui) [11:58:34] !log Upgrade MariaDB and kernel on db1076 [11:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:47] (03PS1) 10Joal: Add cron for banner streaming spark job [puppet] - 10https://gerrit.wikimedia.org/r/395504 (https://phabricator.wikimedia.org/T176983) [11:59:08] elukey: --^ I think this is probably not good as is, I'm looking for comments :) [11:59:25] (03CR) 10jerkins-bot: [V: 04-1] Add cron for banner streaming spark job [puppet] - 10https://gerrit.wikimedia.org/r/395504 (https://phabricator.wikimedia.org/T176983) (owner: 10Joal) [12:01:22] joal: I am going to check after lunch, I'd say to use a profile instead [12:01:58] hm - I'll need more explanations elukey - I'm not good enough in puppet [12:02:02] Talk later elukey [12:02:27] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395505 [12:03:37] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: upgrade to 0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/395507 [12:04:16] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: upgrade to 0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/395507 (owner: 10Giuseppe Lavagetto) [12:06:46] (03CR) 10Hashar: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:06:58] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:07:07] volans: debmonitor has some kind of tox job being run by CI now [12:07:33] hashar: thanks! I forgot to add teh option to ignore missing interpreter, I'll add it [12:07:45] but neither 3.5 or 3.6 are there [12:07:47] volans: the container uses Jessie for now though, so there is only py34 available [12:07:52] :( [12:08:03] will want a stretch based one for another python 3 version [12:08:16] that is until one figure out how to integrate multiple pythons [12:08:25] is it possible to have it stretch-based? [12:09:25] I gotta build a new container yes [12:11:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395505 (owner: 10Marostegui) [12:12:19] ok, I'll look if they run fine in 3.4 and add it in the meanwhile [12:12:42] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395505 (owner: 10Marostegui) [12:13:35] volans: and in tox you can add a parameter to skip env when the interpreter is missing. Something like: skip_missing_interpreters = True [12:13:44] yeah I forget that [12:13:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 with low weight (duration: 00m 44s) [12:13:47] said above [12:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:55] *forgot [12:14:07] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395505 (owner: 10Marostegui) [12:15:14] (03PS4) 10Muehlenhoff: Switch codfw video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395484 [12:16:32] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395510 [12:17:27] _joe_: ok to puppet-merge your "puppet-compiler: upgrade to 0.4.0" change along? [12:17:42] <_joe_> moritzm: yeah sorry [12:17:49] <_joe_> it's a labs-only change and I forgot [12:17:54] k, doing that [12:19:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395510 (owner: 10Marostegui) [12:20:26] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395510 (owner: 10Marostegui) [12:20:37] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395510 (owner: 10Marostegui) [12:21:04] (03PS1) 10Addshore: Wikidata dispatching: randomness to 10 [puppet] - 10https://gerrit.wikimedia.org/r/395512 [12:21:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1034 (duration: 00m 43s) [12:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:28] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1948 bytes in 0.093 second response time [12:28:44] (03PS1) 10Mforns: Add documentation for .m suffix code to pagecounts-ez doc page [puppet] - 10https://gerrit.wikimedia.org/r/395517 (https://phabricator.wikimedia.org/T180452) [12:32:38] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [12:39:04] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3812316 (10mmodell) [12:40:45] apergos: if your still around, we are gonna try increasing the randomness a bit https://gerrit.wikimedia.org/r/395512 [12:41:50] I'm here [12:45:54] (03CR) 10Volans: "Thanks a lot for the review, see my replies inline. I'll send the changes when ready." (0317 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [12:47:19] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398639 (10Joe) Just trying to understand the context - what the original ticket states is we need to redirect from those domains to others. So th... [12:47:21] (03PS1) 10Zfilipin: WIP Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) [12:47:55] (03PS2) 10Zfilipin: WIP Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) [12:48:00] (03CR) 10jerkins-bot: [V: 04-1] WIP Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [12:48:26] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3812412 (10Joe) [12:48:35] (03CR) 10jerkins-bot: [V: 04-1] WIP Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [12:52:05] ok addshore, I'm going to merge this through, understanding that this moves the script arg closer to the previous default value, so we can see what happens [12:52:33] And If we see no improvement then we will just set it back to the origional 15 [12:52:45] (03PS2) 10ArielGlenn: Wikidata dispatching: randomness to 10 [puppet] - 10https://gerrit.wikimedia.org/r/395512 (owner: 10Addshore) [12:53:52] (03CR) 10ArielGlenn: [C: 032] Wikidata dispatching: randomness to 10 [puppet] - 10https://gerrit.wikimedia.org/r/395512 (owner: 10Addshore) [12:55:14] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We still have tons of RuboCop violations from the current version to fix here. I don't think adding even more things to the current TODO l" [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [12:55:48] addshore: live on terbium [12:55:58] thanks!] [12:56:02] yw [12:56:17] I'll watch over the next hour or so [12:58:20] ok great [13:14:24] !log reimaging mw2118/mw2119 (video scalers) to stretch [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:47] (03PS1) 10Addshore: Wikidata dispatching: randomness to 15 [puppet] - 10https://gerrit.wikimedia.org/r/395525 [13:17:00] apergos: ^^ I think lets just revert it back to 15 and i'll file these tickets [13:22:18] (03CR) 10ArielGlenn: [C: 032] Wikidata dispatching: randomness to 15 [puppet] - 10https://gerrit.wikimedia.org/r/395525 (owner: 10Addshore) [13:23:16] thanks for this apergos, I'll be sure to cc you on the tickets I make [13:24:16] live on terbium [13:24:20] thanks! [13:24:24] yes, please do add me as a subscriber [13:33:11] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395526 [13:36:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395526 (owner: 10Marostegui) [13:36:30] (03PS1) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [13:38:21] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395526 (owner: 10Marostegui) [13:38:32] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395526 (owner: 10Marostegui) [13:39:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1076 weight (duration: 00m 44s) [13:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:05] (03PS2) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [13:46:45] (03PS3) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [13:51:04] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395531 [13:51:26] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395531 (owner: 10Muehlenhoff) [13:52:33] (03PS1) 10Muehlenhoff: Add Prometheus exporter for Etherpad [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 [13:54:02] (03PS3) 10Gehel: Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [13:56:48] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3812554 (10MarcoAurelio) >>! In T169450#3812405, @Joe wrote: > Just trying to understand the context - what the original ticket st... [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171205T1400). [14:00:05] subbu and Jhs: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] I can swat today [14:00:14] * subbu is around [14:00:29] subbu: do you want to deploy your patch yourself? [14:00:40] or would you prefer if I deploy it? [14:01:02] why don't you? i've never done core deploys before. don't want to start in a swat window :) [14:01:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice!!! Minor comment inline, rest LGTM" (032 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [14:01:38] subbu: sure, I'll do it, but it's not a core deploy, it's only config ;) [14:01:48] * Jhs_ is also around [14:01:49] I'll let you know when the patch is at mwdebug1002, in a few minutes [14:01:50] right, it is :) [14:03:15] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) (owner: 10Subramanya Sastry) [14:04:50] (03Merged) 10jenkins-bot: Enable RemexHTML on itwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) (owner: 10Subramanya Sastry) [14:06:01] subbu: 395009 is at mwdebug1002, please test and let me know if I can deploy [14:06:02] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3812595 (10fgiunchedi) Do you have any numbers wrt how much disk space and requests git-lfs is supposed to take? swift at the moment is "shared" in the sense that we have a si... [14:06:17] will do. [14:07:01] (03CR) 10jenkins-bot: Enable RemexHTML on itwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) (owner: 10Subramanya Sastry) [14:07:15] (03PS2) 10Zfilipin: Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:07:55] (03CR) 10Zfilipin: [C: 031] Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:08:23] (03CR) 10Muehlenhoff: Add Prometheus exporter for Etherpad (032 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [14:08:33] (03PS2) 10Muehlenhoff: Add Prometheus exporter for Etherpad [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 [14:08:41] (03PS7) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [14:10:22] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395535 [14:10:27] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for Etherpad - https://phabricator.wikimedia.org/T182095#3812602 (10MoritzMuehlenhoff) [14:10:29] ok .. behaviour is as expected on mwdebug1002 on itwiki and dewiki .. [14:10:41] subbu: ok to deploy? [14:10:46] anyone know where i can look at logs to make sure there are no exceptions. _joe_ zeljkof [14:11:06] subbu: I am monitoring logs at mwdlog1001 [14:11:06] (03CR) 10Filippo Giunchedi: Add Prometheus exporter for Etherpad (034 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [14:11:14] ah, logs look good? [14:11:28] subbu: you can log there and run fatalmonitor [14:11:47] (mwlog1001) [14:12:20] subbu: logs are a mess, but that is usual :) [14:12:29] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3812637 (10Gehel) maintenance has been done, icinga check is green again. We can close this. [14:12:37] I don't see anything strange or new, but I might be missing it [14:13:04] subbu: also, logstash links here might be useful https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#SSH_Connections_and_Error_Logs [14:13:06] ya .. lgtm. just one min. [14:13:08] ok. [14:13:34] (03CR) 10Alexandros Kosiaris: [C: 031] Add Prometheus exporter for Etherpad [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [14:14:40] zeljkof, ok, lgtm. let us do it. [14:14:48] subbu: ok, deploying... [14:15:11] Elitre, fyi .. going live on all hosts for itwiki and dewiki now .. verified behavior as expected on mwdebug1002 [14:15:28] aaahhh [14:15:35] (thanks) [14:15:36] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395009|Enable RemexHTML on itwiki and dewiki (T181188 T181190)]] (duration: 00m 43s) [14:15:48] (03PS1) 10Muehlenhoff: Add Prometheus exporter to profile::etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395538 (https://phabricator.wikimedia.org/T182095) [14:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:51] T181188: Enable RemexHTML on dewiki - https://phabricator.wikimedia.org/T181188 [14:15:51] T181190: Enable RemexHTML on itwiki - https://phabricator.wikimedia.org/T181190 [14:15:57] subbu, Elitre: deployed, please check and thanks for deploying with #releng ;) [14:16:05] zeljkof, ty :) [14:16:22] Elitre, one thing to remember is that pages will be cached. [14:16:33] so, only on page edits will folks see remex rendering. [14:16:37] or a purge. [14:16:37] Jhs: please stand by, your commit will be at mwdebug in a few minutes [14:16:39] yeah. [14:17:05] Elitre, let us move to the parsoid channel for additional discussion [14:17:06] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:17:15] zeljkof, (Y) this is the one I asked you about last week, that needs the script run. Sorry to bother you with scripts two days in a row ;) [14:17:36] this one has fewer options though, so less room for error :) [14:19:13] (03Merged) 10jenkins-bot: Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:19:23] (03CR) 10jenkins-bot: Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [14:19:30] Jhs: no problem, scripts are a usual part of deployment [14:19:48] :) [14:19:49] and looks like we were not to blame for the problem with a couple of pages yesterday :) [14:20:03] that's good [14:20:28] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:19] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.031 second response time [14:21:42] Jhs: the commit is at mwdebug, running the script... [14:21:53] (Y) [14:23:00] (03PS1) 10Volans: wmf-auto-reimage: properly get Puppet CA master [puppet] - 10https://gerrit.wikimedia.org/r/395539 (https://phabricator.wikimedia.org/T182096) [14:23:04] the script is done, please check at mwdebug1002 and let me know if I can deploy [14:24:34] Jhs: ^ [14:24:36] https://phabricator.wikimedia.org/T181503#3812718 [14:24:43] (script output) [14:25:44] zeljkof, sorry, does not look right at all :( [14:25:56] Jhs: ouch [14:26:00] https://se.wikipedia.org/wiki/Kategoriija:S%C3%A1megielaid_alfabehta all the special charecters should have their own headings [14:26:30] i mean, the sort order is correct, but the headings are incorrect. Before the headings were correct but the sort order not correct. [14:27:05] i'll have to investigate more how to fix that, can't do that within this window, so i think it's better to abandon [14:28:07] shold I revert the commit? is there a script I should run to get things back as they were? [14:28:13] Jhs: ^ [14:28:13] (03CR) 10Alexandros Kosiaris: [C: 031] Add Prometheus exporter to profile::etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395538 (https://phabricator.wikimedia.org/T182095) (owner: 10Muehlenhoff) [14:29:55] zeljkof, don't know. Just guessing, but if you revert the commit and run the script with --previous-collation=uca-se-u-kn that miiight work? [14:30:51] Jhs: let's try it [14:31:25] (03PS1) 10Zfilipin: Revert "Add category collation for sewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395540 [14:32:31] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395540 (owner: 10Zfilipin) [14:32:53] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3812738 (10Ottomata) [14:33:21] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3812740 (10Ottomata) [14:33:52] (03Merged) 10jenkins-bot: Revert "Add category collation for sewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395540 (owner: 10Zfilipin) [14:35:15] Jhs: the revert is at mwdebug, running the script [14:36:54] (03CR) 10jenkins-bot: Revert "Add category collation for sewiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395540 (owner: 10Zfilipin) [14:37:41] Jhs: the script did not do anything :( https://phabricator.wikimedia.org/T181503#3812747 [14:38:08] zeljkof, weird. because now it looks normal again (i.e. the way it did before) [14:38:33] Jhs: the same for me :| strange [14:38:43] should I deploy the revert? [14:38:48] sure [14:39:05] you can investigate what went wrong and we can deploy a new commit some other time [14:39:10] oui :) [14:39:15] ok, deploying [14:40:14] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395540|Revert "Add category collation for sewiki" (T181503)]] (duration: 00m 44s) [14:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] T181503: Add proper category collation for the Northern Sami Wikipedia - https://phabricator.wikimedia.org/T181503 [14:40:43] Jhs: deployed, please check if something is broken and thanks for deploying with #releng ;) [14:40:54] (03CR) 10Ottomata: [C: 031] "Couple nits, but+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:41:21] zeljkof, thanks, it looks ok now :) [14:41:42] !log EU SWAT finished [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:27] (03PS5) 10Ottomata: Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) [14:42:28] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.092 second response time [14:42:32] (03CR) 10Muehlenhoff: Add Prometheus exporter for Etherpad (034 comments) [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [14:42:40] (03PS3) 10Muehlenhoff: Add Prometheus exporter for Etherpad [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 [14:45:09] (03PS2) 10Volans: wmf-auto-reimage: properly get Puppet CA master [puppet] - 10https://gerrit.wikimedia.org/r/395539 (https://phabricator.wikimedia.org/T182096) [14:47:06] (03CR) 10Giuseppe Lavagetto: [C: 031] wmf-auto-reimage: properly get Puppet CA master [puppet] - 10https://gerrit.wikimedia.org/r/395539 (https://phabricator.wikimedia.org/T182096) (owner: 10Volans) [14:48:09] (03PS2) 10Marostegui: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395535 [14:48:34] (03CR) 10Volans: [C: 032] wmf-auto-reimage: properly get Puppet CA master [puppet] - 10https://gerrit.wikimedia.org/r/395539 (https://phabricator.wikimedia.org/T182096) (owner: 10Volans) [14:52:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395535 (owner: 10Marostegui) [14:52:43] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3810367 (10Joe) A few things to note: - htmlCacheUpdate job frequency varies a **lot** between wikis. Even a moderately large wiki like `dewiki` can have rela... [14:53:32] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395535 (owner: 10Marostegui) [14:53:46] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395535 (owner: 10Marostegui) [14:54:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1076 (duration: 00m 43s) [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:32] (03PS6) 10Ottomata: Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) [14:56:06] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3812821 (10StevenJ81) Well, that's why I suggested creating this "wikimorgue" (or whatever) name. This gives the phantom wikis nam... [14:56:26] (03PS1) 10Giuseppe Lavagetto: environments: add environment for removing hiera autolookups [puppet] - 10https://gerrit.wikimedia.org/r/395545 (https://phabricator.wikimedia.org/T181971) [14:56:28] (03PS1) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) [14:59:03] (03CR) 10Eevans: [C: 031] "Ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/395086 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [14:59:12] (03PS2) 10Eevans: hieradata: enable restbase1014-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395086 (https://phabricator.wikimedia.org/T179422) [14:59:39] (03PS1) 10Faidon Liambotis: geoip: add GeoIP2-ISP to the list [puppet] - 10https://gerrit.wikimedia.org/r/395549 [15:00:47] (03CR) 10Dzahn: [C: 032] hieradata: enable restbase1014-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395086 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [15:01:18] (03PS4) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [15:01:20] mutante: ty! [15:03:49] !log bootstrapping cassandra, restbase1014-c - T179422 [15:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:00] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:04:34] (03PS1) 10Ottomata: Dummy files for PCC kafka-jumbo [labs/private] - 10https://gerrit.wikimedia.org/r/395551 [15:05:04] (03PS2) 10Ottomata: Dummy files for PCC kafka-jumbo [labs/private] - 10https://gerrit.wikimedia.org/r/395551 [15:05:23] (03CR) 10Ottomata: [V: 032 C: 032] Dummy files for PCC kafka-jumbo [labs/private] - 10https://gerrit.wikimedia.org/r/395551 (owner: 10Ottomata) [15:06:21] (03PS1) 10Phantom42: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) [15:08:31] (03PS2) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) [15:08:43] (03PS7) 10Ottomata: Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) [15:08:48] (03CR) 10Ottomata: [V: 032 C: 032] Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [15:08:53] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9176/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [15:10:58] (03PS6) 10Alexandros Kosiaris: Update to 1.7.10 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [15:15:25] !log restarrting kafka-jumbo brokers, applying SSL (downtime scheduled) [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:21] (03PS7) 10Alexandros Kosiaris: Update to 1.7.10 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [15:21:05] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/9175/" [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [15:32:29] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused [15:37:58] (03PS8) 10Alexandros Kosiaris: Update to 1.7.10 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [15:45:00] (03PS9) 10Alexandros Kosiaris: Update to 1.7.10 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) [15:51:19] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3813182 (10Cmjohnson) [15:51:38] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Cmjohnson) Sent an email to @Legal about signed NDA [15:52:02] 10Operations, 10Scoring-platform-team, 10Wikimedia-Logstash, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3813187 (10awight) Celery is now logging verbosely to /srv/log/ores/app.log, please wire that into logstash. [15:53:54] !log awight@tin Started deploy [ores/deploy@6baed71]: (non-production) Test ORES deployment to ores100[1-2] [15:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] !log awight@tin Finished deploy [ores/deploy@6baed71]: (non-production) Test ORES deployment to ores100[1-2] (duration: 01m 01s) [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:29] !log beginning stress test on ores* (non-production) [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:42] 10Operations, 10Wikimedia-Logstash, 10hardware-requests, 10Discovery-Search (Current work), 10Patch-For-Review: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3813321 (10Gehel) [16:14:31] 10Operations, 10Wikimedia-Logstash, 10hardware-requests, 10Discovery-Search (Current work), 10Patch-For-Review: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3813347 (10Gehel) [16:14:37] (03PS1) 10Filippo Giunchedi: profile: add redis_exporter to ores::redis [puppet] - 10https://gerrit.wikimedia.org/r/395563 (https://phabricator.wikimedia.org/T148637) [16:17:41] (03CR) 10Alexandros Kosiaris: [C: 031] profile: add redis_exporter to ores::redis [puppet] - 10https://gerrit.wikimedia.org/r/395563 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [16:18:47] (03CR) 10Filippo Giunchedi: [C: 031] "Yup, LGTM!" [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [16:20:04] (03CR) 10Filippo Giunchedi: "Thanks Alex!" [puppet] - 10https://gerrit.wikimedia.org/r/395563 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [16:21:16] (03PS1) 10Gehel: logstash: move eventlogging collection to logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/395564 (https://phabricator.wikimedia.org/T175830) [16:21:18] (03PS1) 10Gehel: logstash: decommission logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/395565 (https://phabricator.wikimedia.org/T175830) [16:21:20] (03PS2) 10Filippo Giunchedi: profile: add redis_exporter to ores::redis [puppet] - 10https://gerrit.wikimedia.org/r/395563 (https://phabricator.wikimedia.org/T148637) [16:22:06] (03CR) 10Filippo Giunchedi: [C: 032] profile: add redis_exporter to ores::redis [puppet] - 10https://gerrit.wikimedia.org/r/395563 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [16:23:41] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3813376 (10awight) [16:38:01] (03PS1) 10Ottomata: Use super.users instead of kafka-acls exec to authenticate broker principals [puppet] - 10https://gerrit.wikimedia.org/r/395568 (https://phabricator.wikimedia.org/T167304) [16:39:37] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3813423 (10Cmjohnson) a:03Cmjohnson [16:39:57] (03CR) 10Elukey: [C: 031] Use super.users instead of kafka-acls exec to authenticate broker principals [puppet] - 10https://gerrit.wikimedia.org/r/395568 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [16:40:33] (03CR) 10Ottomata: [C: 032] Use super.users instead of kafka-acls exec to authenticate broker principals [puppet] - 10https://gerrit.wikimedia.org/r/395568 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [16:41:08] (03PS1) 10Filippo Giunchedi: prometheus: add ores redis job [puppet] - 10https://gerrit.wikimedia.org/r/395569 (https://phabricator.wikimedia.org/T148637) [16:48:31] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3813464 (10awight) Running a low-ish test at 1,200 req/min, https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=151249012992... [16:51:08] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3813467 (10Cmjohnson) I disconnected the disk shelves and powered down. @faidon please let me know when and if it's okay to coordinate the drop off. [16:57:16] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3813527 (10awight) I got this shred of stack trace from `service celery-ores-worker status -l`, but can't get at anything more with m... [17:00:04] godog, moritzm, and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171205T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:32] yep no patches [17:00:50] * godog scavenging for a gif [17:01:28] aahaah [17:01:48] https://i.imgur.com/27BH2V1.mp4 [17:02:52] urandom: mutante has kindly offered to help with the firmware upgrade tomorrow [17:03:17] ^ yup, we can do that tomorrow [17:04:33] (03PS2) 10Cmjohnson: mend [puppet] - 10https://gerrit.wikimedia.org/r/393774 [17:06:03] (03CR) 10Cmjohnson: [C: 032] mend [puppet] - 10https://gerrit.wikimedia.org/r/393774 (owner: 10Cmjohnson) [17:06:33] cmjohnson1: o/ - just to quickly check with you, the new mw* servers are ok to reimage etc.. right? [17:06:53] elukey: yes they're all yours [17:07:01] super thanks [17:07:03] just had that small mishap with dns last week [17:07:23] yep yep, just wanted to figure out if any action was still pending [17:07:31] (03PS2) 10Gehel: wdqs: use the /readiness-probe in WDQS icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/394987 (https://phabricator.wikimedia.org/T181989) [17:08:06] madhuvishy: o/ - if you have time today would you mind to check gerrit or phab for the kafka1018 issue? [17:08:15] (03CR) 10Gehel: [C: 032] wdqs: use the /readiness-probe in WDQS icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/394987 (https://phabricator.wikimedia.org/T181989) (owner: 10Gehel) [17:09:25] mutante: that's awesome; thank you! [17:10:00] (03Abandoned) 10Cmjohnson: Adding production dns for db1111/2 [dns] - 10https://gerrit.wikimedia.org/r/393781 (owner: 10Cmjohnson) [17:10:39] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3128: Connection refused [17:10:42] (03PS2) 10Faidon Liambotis: geoip: add GeoIP2-ISP to the list [puppet] - 10https://gerrit.wikimedia.org/r/395549 [17:10:59] (03CR) 10Faidon Liambotis: [C: 032] geoip: add GeoIP2-ISP to the list [puppet] - 10https://gerrit.wikimedia.org/r/395549 (owner: 10Faidon Liambotis) [17:11:40] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.157 second response time [17:14:10] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3813603 (10Dzahn) 05Open>03stalled [17:15:10] mutante: would you have the availability to do it today, instead? [17:15:52] mutante: anytime after the bootstrap of 1014-c finishes (2-3 hours, I guess) [17:19:42] 10Operations, 10Domains, 10Traffic: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3813612 (10Dzahn) This is blocked technically on T133548 (and partially on the other ones linked in my last comment). [17:20:36] urandom: ok, let's do that today then [17:20:59] well, let me move to co-working space. but in 2-3 hours is good [17:21:04] mutante: great, i'll let you know when the bootstrap is done [17:21:18] after which, whenever is convenient for you [17:21:44] (03PS1) 10Cmjohnson: Adding mgmt and production dns entries for db111[1-4] [dns] - 10https://gerrit.wikimedia.org/r/395574 [17:22:10] 'kk [17:25:09] PROBLEM - mediawiki-installation DSH group on mw2118 is CRITICAL: Host mw2118 is not in mediawiki-installation dsh group [17:26:34] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Prometheus exporter for Etherpad [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/395532 (owner: 10Muehlenhoff) [17:27:11] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2118.codfw.wmnet [17:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:14] elukey: hey I looked at it briefly, what would you need from me? [17:29:59] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PDNS recursor (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [17:30:46] (03PS2) 10Dzahn: contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395198 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [17:31:19] (03CR) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [17:31:30] (03CR) 10Dzahn: [C: 032] contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395198 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [17:32:17] (03PS2) 10Muehlenhoff: Add Prometheus exporter to profile::etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395538 (https://phabricator.wikimedia.org/T182095) [17:33:32] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus exporter to profile::etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395538 (https://phabricator.wikimedia.org/T182095) (owner: 10Muehlenhoff) [17:33:56] (03CR) 10Dzahn: "yep, no issue on contint1001 this time" [puppet] - 10https://gerrit.wikimedia.org/r/395198 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [17:36:58] (03CR) 10Dzahn: Fix killing dumpers in Wikidata entity dumpers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [17:42:18] (03PS1) 10Muehlenhoff: Add Prometheus scraper config for Etherpad [puppet] - 10https://gerrit.wikimedia.org/r/395577 (https://phabricator.wikimedia.org/T182095) [17:42:21] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3813675 (10awight) I'm pretty sure it's just an OOM, still it would be nice to be able to read more logs. The available memory graph... [17:42:33] (03PS1) 10Ema: mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) [17:42:46] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3813690 (10mobrovac) Set a deployment window for the migration for [2017-12-06 17:30 UTC](https://wikitech.wikimedia.org/wiki/Deployments#Week_of_December_4th). [17:45:12] (03PS1) 10Awight: Reduce the number of Celery workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/395579 (https://phabricator.wikimedia.org/T169246) [17:45:39] akosiaris: ^ Have a minute to knock that out? [17:46:07] I think we’re getting closer to an answer with the SPT on our new cluster. [17:46:39] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3813700 (10Cmjohnson) @greg Can you review/approve please [17:47:01] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3813702 (10Cmjohnson) a:03Cmjohnson [17:47:48] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3813703 (10Halfak) This is probably not a memory leak. Workers increase in memory because they variable amounts of data for performi... [17:53:45] (03CR) 10Dzahn: "do they all run as the same user? does that user run other stuff? thinking of "killall -u username" for this" [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171205T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:23] no parsoid deploy today [18:05:00] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Requesting access to deploy-service for pnorman - https://phabricator.wikimedia.org/T182066#3813772 (10greg) Approve. [18:10:09] 10Operations, 10ops-eqiad: fix hostname for fmsw-eqiad to fmsw-c1-eqiad - https://phabricator.wikimedia.org/T180821#3813780 (10Cmjohnson) 05Open>03Resolved [18:17:26] (03PS1) 10Ladsgroup: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) [18:19:19] (03CR) 10jerkins-bot: [V: 04-1] Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [18:21:01] (03PS2) 10Ladsgroup: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) [18:24:19] !log ppchelko@tin Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017 [18:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:29] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [18:24:33] !log ppchelko@tin Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. Limited to kafka1001. T180017 (duration: 00m 14s) [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] RECOVERY - mediawiki-installation DSH group on mw2118 is OK: OK [18:40:29] (03CR) 10Dzahn: [C: 031] "lgtm, checked these IPs are all "NXDOMAIN" currently and don't appear to be used" [dns] - 10https://gerrit.wikimedia.org/r/395574 (owner: 10Cmjohnson) [18:56:24] (03PS1) 10Ottomata: Grant Create (topic) and Describe on kafka cluster resource for User:Anonymous [puppet] - 10https://gerrit.wikimedia.org/r/395586 (https://phabricator.wikimedia.org/T167304) [18:58:06] (03PS2) 10Ottomata: Grant Create (topic) and Describe on kafka cluster resource for User:Anonymous [puppet] - 10https://gerrit.wikimedia.org/r/395586 (https://phabricator.wikimedia.org/T167304) [19:00:02] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9180/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/395586 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [19:00:04] (03CR) 10Ottomata: [C: 032] Grant Create (topic) and Describe on kafka cluster resource for User:Anonymous [puppet] - 10https://gerrit.wikimedia.org/r/395586 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [19:01:24] (03PS12) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [19:02:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [19:03:09] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:41] (03PS13) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [19:05:10] (03CR) 10Gehel: "eventlogging collection is queued in kafka, so merging the change and applying it first on logstash1003 and then on logstash1007 should en" [puppet] - 10https://gerrit.wikimedia.org/r/395564 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [19:06:17] (03CR) 10EBernhardson: [C: 031] "seems reasonable enough." [puppet] - 10https://gerrit.wikimedia.org/r/395564 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [19:09:29] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1956 bytes in 0.246 second response time [19:17:06] (03PS2) 10Gehel: logstash: move eventlogging collection to logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/395564 (https://phabricator.wikimedia.org/T175830) [19:18:35] (03CR) 10Gehel: [C: 032] logstash: move eventlogging collection to logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/395564 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [19:20:18] !log moving eventlogging collection by logstash from logstash1003 to logstash1007, no messages **should** be lost - T175830 [19:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:27] T175830: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830 [19:28:09] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:47] (03PS1) 10Ottomata: Enable async for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/395593 (https://phabricator.wikimedia.org/T180017) [19:30:27] (03CR) 10Ppchelko: [C: 031] Enable async for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/395593 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [19:31:27] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3814037 (10mmodell) [19:32:11] (03CR) 10Ottomata: [C: 032] Enable async for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/395593 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [19:32:39] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:15] !log ppchelko@tin Started deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017 [19:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:26] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [19:37:45] !log ppchelko@tin Finished deploy [eventlogging/eventbus@6ca0372]: Make the kafka async deliver callback thread-safe. T180017 (duration: 01m 29s) [19:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:36] (03CR) 10Dzahn: [C: 032] Reduce the number of Celery workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/395579 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [19:41:47] (03PS2) 10Dzahn: Reduce the number of Celery workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/395579 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [19:41:56] lol mutante <3 [19:42:06] hehe [19:45:08] (03PS8) 10Gehel: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) [19:45:15] (03PS2) 10ArielGlenn: remove extraneous cleanups for misc dumps from scripts and crons [puppet] - 10https://gerrit.wikimedia.org/r/395495 [19:46:22] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3814089 (10mmodell) @fgiunchedi I don't have a very clear picture of disk usage given that it will be growing but I would expect that request volume will be pretty low. This i... [19:46:26] (03CR) 10ArielGlenn: [C: 032] remove extraneous cleanups for misc dumps from scripts and crons [puppet] - 10https://gerrit.wikimedia.org/r/395495 (owner: 10ArielGlenn) [19:46:57] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3814091 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [19:47:03] (03CR) 10Gehel: [C: 032] wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [19:47:25] (03PS9) 10Gehel: wdqs: schedule cronjob to reload categories [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) [19:47:50] mutante: If you feel like kicking that out, it affects ores100*.eqiad.wmnet [19:48:03] donno how automated that puppet deployment is... [19:48:37] awight: ah, should i run puppet on all ores1* ? i can do that. or it runs at random times in the next half hour [19:48:40] hold on [19:49:28] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3814099 (10Marostegui) Thanks - it is rebuilding correctly: ``` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I Port... [19:50:42] !log forcing puppet ron on ores eqiad to reduce number of celery workers used for stress test [19:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:02] awight: it should be "deployed" now [19:54:16] wicked! [20:00:04] no_justification: (Dis)respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171205T2000). Please do the needful. [20:00:07] No GERRIT patches in the queue for this window AFAICS. [20:03:14] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3814132 (10greg) [20:05:39] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.853 second response time [20:06:42] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3814151 (10greg) [20:06:56] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3814156 (10Halfak) [20:08:32] marostegui: would you have some time for https://phabricator.wikimedia.org/T182143 ? Well not right now, but tomorrow or the day after (Pinging you since you helped me with the task mentioned) [20:08:49] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:49] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 9.011 second response time [20:14:49] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:29] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.137 port 9042 [20:18:31] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3814202 (10awight) Downsizing my estimate. With 100 workers/box, we're hovering around 28GB free, so (57.5GB - 28GB) / 100 workers =... [20:20:48] (03PS1) 10Chad: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395607 [20:21:18] !log demon@tin Started scap: bootstrap wmf.11 [20:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:39] (03PS1) 10Awight: Tune ORES Celery workers up a bit [puppet] - 10https://gerrit.wikimedia.org/r/395608 (https://phabricator.wikimedia.org/T169246) [20:23:17] (03PS2) 10Awight: Tune ORES Celery workers up a bit [puppet] - 10https://gerrit.wikimedia.org/r/395608 (https://phabricator.wikimedia.org/T169246) [20:24:09] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 362.67 seconds [20:25:03] 10Operations, 10RESTBase-Cassandra, 10Services (later): cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#3814209 (10Eevans) 05Open>03Resolved The limitation here is that there is no per-session concurrency for streaming in Cassandra. The throughput observed is t... [20:26:10] is there phabricator maintenance going on? [20:26:28] (03PS2) 10Aklapper: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [20:26:54] jynus: don't think so? [20:27:23] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=misc&var-shard=m3&var-role=slave [20:27:29] lots of writes going on [20:27:45] twentyafterfour: ^ [20:28:11] jynus: debugging search [20:28:19] I'm running a search index tuning script [20:28:30] is it too much? I can throttle it back some [20:28:31] perfect, so "maintenance going on" [20:28:37] it is ok as long as it is logged [20:28:40] maybe I missed it [20:28:55] jynus: no sorry I didn't log, didn't realize it would be causing a lot of writes [20:29:13] please do now, that makes me "not worried" :-) [20:29:27] !log phabricator: running `sudo bin/search ngrams --threshold 0.2` [20:29:29] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.103 second response time [20:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:00] thanks [20:30:01] mutante: we are a Go on that firmware [20:31:45] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3814221 (10Cmjohnson) @Groovier, Legal does not have a NDA for you on file please reach out to @RStallman-legalteam to review and sign. Thanks [20:36:44] (03PS1) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [20:37:11] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.37 seconds Jcrespo ongoing phabricator maintenance, passive (cold) slave with crossdc overhead [20:37:13] (03CR) 10jerkins-bot: [V: 04-1] move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [20:38:39] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.007 second response time [20:41:29] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3814243 (10RStallman-legalteam) @Groovier, to customize the NDA for you, we'll need a physical address and an email address. I will then route... [20:43:06] (03PS2) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [20:44:36] (03CR) 10Halfak: [C: 031] "Any single worker should be able to clear the queue in 10 seconds." [puppet] - 10https://gerrit.wikimedia.org/r/395608 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [20:44:59] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724#3814250 (10Eevans) [20:45:01] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#3814246 (10Eevans) 05Open>03declined I feel this is pretty much a non-starter for RESTBase use-cases. The write amplificati... [20:46:29] (03CR) 10Alexandros Kosiaris: [C: 032] Tune ORES Celery workers up a bit [puppet] - 10https://gerrit.wikimedia.org/r/395608 (https://phabricator.wikimedia.org/T169246) (owner: 10Awight) [20:48:55] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3814268 (10Pchelolo) > I think it's ok to use a small wiki to test the functionality without touching the concurrency configs, though. Which wiktionary did you... [20:50:11] !log phabricator: running `sudo bin/garbage collect --collector search.ferret.ngram` [20:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:51] 10Operations, 10Cassandra, 10RESTBase, 10Services (later): Evaluate ScyllaDB as a near-term replacement to Cassandra - https://phabricator.wikimedia.org/T150811#3814283 (10Eevans) 05Open>03Resolved Just from an operational perspective, this would be a large undertaking, with questionable benefit (at so... [20:55:10] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:55:49] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:37] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3814308 (10akosiaris) >>! In T169246#3813527, @awight wrote: > I got this shred of stack trace from `service celery-ores-worker statu... [20:57:40] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.653 second response time [21:03:49] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:14] (03PS1) 10Ppchelko: Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) [21:04:49] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 8.521 second response time [21:05:33] (03CR) 10jerkins-bot: [V: 04-1] Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [21:07:11] (03PS2) 10Ppchelko: Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) [21:10:49] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:20] !log demon@tin Finished scap: bootstrap wmf.11 (duration: 51m 01s) [21:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:42] madhuvishy: (just get the msg after dinner sorry) - I wanted to know if there is anything to backup or more things to do to shutdown the jupyter stuff on notebook1002.. Kinda your green light to proceed, that's it. Don't want to mess with something that might be ongoing! :) [21:23:34] if it is just a matter of reimage etc.. then I'll proceed tomorrow after sending an email to analytics@ [21:24:04] (03CR) 10Chad: [C: 032] group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395607 (owner: 10Chad) [21:25:28] (03Merged) 10jenkins-bot: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395607 (owner: 10Chad) [21:26:29] no_justification: ooooooohhhh, first train with no wikidata build :D [21:26:41] * no_justification does a dance [21:26:45] haha [21:26:51] * addshore passes out [21:27:15] (03CR) 10jenkins-bot: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395607 (owner: 10Chad) [21:30:45] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3814381 (10Halfak) Looks like that was before worker count was changed, right? The last stress test started at 20:08. [21:31:31] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 shenanigans / wmf.11 [21:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:19] (03PS3) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql [puppet] - 10https://gerrit.wikimedia.org/r/391336 [21:39:48] (03PS4) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [21:40:48] (03PS5) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [21:44:49] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [21:52:08] !log restbase1010 - upgrading firmware - Flashing Smart Array P440ar in Slot 0 [ 3.56 -> 6.06 ] [21:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:22] (03PS6) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [21:55:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3814414 (10Halfak) It seems that the fastest we can send requests from one machine is about 4.5k/min or 75/sec. I tried running two... [21:55:11] (03PS7) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [21:56:23] !log draining cassandra instances, restbase1010 - T178177 [21:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:33] T178177: Investigate aberrant Cassandra columnfamily read latency of restbase1010 - https://phabricator.wikimedia.org/T178177 [21:58:21] !log restbase1010 - upgraded HP firmware (Flashing Smart Array P440ar in Slot 0 [ 3.56 -> 6.06 ]) T141756 T178177 [21:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:31] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [22:00:05] !log restbase1010 - rebooting for firmware upgrade [22:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:29] RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [22:02:35] (03CR) 10Paladox: "This is what it looks like" [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [22:13:15] !log restbase1010 failed at reboot with P6431 , after a cold start (power off, power on) it came back though :) (T178177 T141756) [22:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:26] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [22:13:28] T178177: Investigate aberrant Cassandra columnfamily read latency of restbase1010 - https://phabricator.wikimedia.org/T178177 [22:16:59] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:45] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3814494 (10Halfak) OK nevermind. It seems like we have another limit. I ran a stress test on ores1001 and ores1002. Here's what I... [22:34:11] (03PS1) 10Dzahn: puppetmasters codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395655 (https://phabricator.wikimedia.org/T177225) [22:35:51] pdfrender on scb1001 is active (running) despite that icinga line [22:36:21] it keeps happening though [22:41:52] (03PS2) 10Dzahn: puppetmasters codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395655 (https://phabricator.wikimedia.org/T177225) [22:42:20] (03CR) 10Dzahn: [C: 032] "in this case, not by role and all at once" [puppet] - 10https://gerrit.wikimedia.org/r/395655 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:44:12] (03PS8) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [22:49:49] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [22:54:10] (03PS9) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [22:55:34] (03PS10) 10Paladox: planet: Add a xhtml archive plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 [22:59:22] (03PS1) 10Volans: wmf-auto-reimage: add --conftool-value option [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) [22:59:24] (03PS1) 10Volans: wmf-auto-reimage: improve screen/tmux detection [puppet] - 10https://gerrit.wikimedia.org/r/395663 (https://phabricator.wikimedia.org/T181796) [23:13:59] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:59] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 5.082 second response time [23:18:00] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:00] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.888 second response time [23:23:04] (03PS11) 10Dzahn: planet: Add RSS plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:28:09] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:57] (03PS12) 10Dzahn: planet: Add RSS plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:29:27] (03CR) 10jerkins-bot: [V: 04-1] planet: Add RSS plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:31:18] (03PS1) 10Dzahn: puppetmasters eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395668 (https://phabricator.wikimedia.org/T177225) [23:32:42] (03PS13) 10Dzahn: planet: Add RSS plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:34:37] (03PS2) 10Dzahn: puppetmasters eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395668 (https://phabricator.wikimedia.org/T177225) [23:35:34] (03CR) 10Dzahn: [C: 032] puppetmasters eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395668 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:37:36] (03CR) 10Dzahn: [C: 032] planet: Add RSS plugin to rawdog [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:42:04] (03Draft1) 10Paladox: planet: Add stretch check in rawdogplugin.pp [puppet] - 10https://gerrit.wikimedia.org/r/395671 [23:42:08] (03PS2) 10Paladox: planet: Add stretch check in rawdogplugin.pp [puppet] - 10https://gerrit.wikimedia.org/r/395671 [23:42:49] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown): File[/etc/rawdog/uk/plugins],File[/etc/rawdog/pl/plugins],File[/etc/rawdog/fr/plugins],File[/etc/rawdog/it/plugins] [23:42:55] (03PS3) 10Paladox: planet: Add stretch check in rawdogplugin.pp [puppet] - 10https://gerrit.wikimedia.org/r/395671 [23:43:09] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.609 second response time [23:44:37] (03CR) 10Dzahn: "works on stretch, follow-up for jessie, @Legoktm and the links where the plugin has been obtained are added there as well https://gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/392657 (owner: 10Paladox) [23:45:06] your welcome :) [23:46:53] paladox: let's put the "if stretch" right where planet::rawdogplugin is used.. instead of inside the class [23:47:00] ok [23:47:55] (03PS4) 10Paladox: planet: Add stretch check in init.pp around rawdogplugin [puppet] - 10https://gerrit.wikimedia.org/r/395671 [23:48:09] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:55] (03CR) 10Dzahn: [C: 032] planet: Add stretch check in init.pp around rawdogplugin [puppet] - 10https://gerrit.wikimedia.org/r/395671 (owner: 10Paladox) [23:50:31] paladox: 👍 it fixed it! thanks [23:50:32] (03PS1) 10Ladsgroup: Make wikidata cronjobs use the Wikibase extension and not the build [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) [23:50:39] your welcome :) [23:50:56] mutante: hey, do you have a minute to check this? https://gerrit.wikimedia.org/r/395672 [23:51:12] It will make lots of problems if we don't merge it soon [23:51:20] (03CR) 10Legoktm: [C: 031] Make wikidata cronjobs use the Wikibase extension and not the build [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [23:52:06] Amir1: looking.. yes, i saw in chat earlier it was the first deploy without wikidata build [23:52:41] Thanks [23:52:49] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:53:25] checking it on maintenance server [23:53:44] planet1001? Every time I think I know all clusters, I'm proven to be wrong [23:54:19] Amir1: https://wikitech.wikimedia.org/wiki/Planet [23:54:42] (mentions Bugzilla, oops :) [23:55:03] see this https://meta.wikimedia.org/wiki/Planet_Wikimedia [23:55:24] Thanks, it seems super old [23:55:43] that's why we are working on replacing it with a package that is maintained and in stretch [23:55:50] that's what those changes were for [23:55:58] yes, it exists since forever [23:56:07] this would be the 3rd software providing the feeeds [23:57:00] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.559 second response time [23:57:49] (03CR) 10Dzahn: [C: 032] "confirmed on terbium, and today was the first deploy without wikidata build" [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [23:57:56] Amir1: newer: http://planet-hotdog.wmflabs.org/ [23:58:11] (03PS2) 10Dzahn: Make wikidata cronjobs use the Wikibase extension and not the build [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [23:58:17] back to the maintenance jobs, doing that [23:58:24] clicked on one of them and got "Resource Limit Is Reached" [23:58:48] Probably someone in WMDE office clicked on all of the links :D [23:58:58] i see, it's pure coincidence [23:59:04] the error is on a remote server [23:59:28] it's http://infodisiac.com/blog/2017/12/wiki-loves-monuments-2017/ that is broken [23:59:41] wlm [23:59:53] and it happened to be the top post