[00:27:39] RECOVERY - Memory correctable errors -EDAC- on elastic1029 is OK: (C)4 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [03:25:38] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [03:32:03] (03PS5) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [04:20:16] (03PS1) 10BryanDavis: toolforge: Remove updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/542777 (https://phabricator.wikimedia.org/T229261) [04:48:07] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) Sounds good thanks - I will have this host ready. [04:53:26] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/542778 (https://phabricator.wikimedia.org/T233273) [04:55:33] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/542778 (https://phabricator.wikimedia.org/T233273) (owner: 10Marostegui) [04:56:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9318 and previous config saved to /var/cache/conftool/dbconfig/20191014-045629-marostegui.json [04:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:35] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [04:56:53] !log Depool labsdb1009 for on-site maintenance - T233273 [04:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:57] T233273: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 [05:04:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) [05:06:25] PROBLEM - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: root PID: 19702, 1734913s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [05:19:55] 10Operations, 10DBA: db2068 is misbehaving (but is depooled) - https://phabricator.wikimedia.org/T235366 (10Marostegui) Thanks for filling this out. This host had a storage crash sometime ago T180927 and it looks like it had another one: Logs from the 13th. ` description=An Unrecoverable System Error (NMI... [05:27:13] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:28:32] 10Operations, 10DBA: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) [05:29:29] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:29:32] 10Operations, 10DBA: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) [05:30:27] (03PS1) 10Marostegui: db2068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/542782 (https://phabricator.wikimedia.org/T235399) [05:30:54] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) p:05Triage→03Normal [05:32:39] (03CR) 10Marostegui: [C: 03+2] db2068: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/542782 (https://phabricator.wikimedia.org/T235399) (owner: 10Marostegui) [05:40:05] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:47:45] !log Remove db2068 from tendril and zarcillo T235399 [05:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:49] T235399: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 [05:50:51] gilles: o/ - on the webperf nodes the /srv partition is at ~96% usage, worst offender in /srv/xenon/logs/daily [05:51:17] it is not super urgent for the moment so no need to drop now, but some clean up sooner rather than later would be good [05:54:54] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2068 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542783 (https://phabricator.wikimedia.org/T235399) [05:55:57] (03PS1) 10Elukey: Move an-conf* dchp configuration in the right linux-host-entries file [puppet] - 10https://gerrit.wikimedia.org/r/542784 (https://phabricator.wikimedia.org/T227025) [05:57:00] (03CR) 10Elukey: [C: 03+2] Move an-conf* dchp configuration in the right linux-host-entries file [puppet] - 10https://gerrit.wikimedia.org/r/542784 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [05:59:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2068 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542783 (https://phabricator.wikimedia.org/T235399) (owner: 10Marostegui) [06:00:29] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2068 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542783 (https://phabricator.wikimedia.org/T235399) (owner: 10Marostegui) [06:02:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2068 from config T235399 (duration: 00m 53s) [06:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:17] T235399: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 [06:03:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2068 from config T235399 (duration: 00m 51s) [06:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:15] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) [06:04:29] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:04:52] this is ne [06:05:03] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [06:06:54] dcausse, gehel o/ - from Icinga I can see a unassigned shard check for ES [06:10:14] (03PS1) 10Aklapper: Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) [06:12:52] * elukey reads https://wikitech.wikimedia.org/wiki/Search/Trouble#Stuck_in_red [06:13:16] elukey: eqiad is having this problem intermittently, probably nodes with disk space running out. It usually recovers. [06:13:27] I'm acking the alert [06:13:41] onimisionipe: o/ sorry I forgot you in the ping! [06:13:52] Nodes will be replaced soon [06:14:02] the alarm started a day ago this is why I asked [06:14:05] super thanks! [06:14:43] If this persists, we are going to have to reindex the big wikis after changing how shards spread out. I'm watching this [06:14:50] Thanks for the ping [06:15:59] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[3](2019-10-09T14:42:44.498Z) Mathew.onipe Enwiki seems to be having this issue sometimes once a week. If it persists, we are going to have to reindex. I will keep an eye - The acknowledgement expires at: 2019-10-15 18:12:26. https://wikitech.wikimedia.org/wiki/Search%23Administration [06:17:24] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Lea_Lacroix_WMDE) @Effeietsanders Yep, that was a similar pattern for me: a name that seems "normal" followed by a 5 digits number. I got several thousands suddenly subscribing to... [06:24:33] (03CR) 10ArielGlenn: "Can you explain or point me to docs on the cache? Is this memcache or something else backing it? The patch is pobably fine but I want to u" [puppet] - 10https://gerrit.wikimedia.org/r/542278 (owner: 10Hoo man) [06:28:16] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) 05Open→03Resolved Fixed puppet, then manually `/etc/default/grub` on all hosts and finally `sudo update-grub`. Restored all the serial settings... [06:45:45] (03PS1) 10Elukey: role::analytics_cluster::zookeeper: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542789 (https://phabricator.wikimedia.org/T217057) [06:51:42] (03PS2) 10Elukey: role::analytics_cluster::zookeeper: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/542789 (https://phabricator.wikimedia.org/T217057) [06:53:58] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18859/" [puppet] - 10https://gerrit.wikimedia.org/r/542789 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [07:00:18] (03PS1) 10Marostegui: mariadb: Remove db2068 from config [puppet] - 10https://gerrit.wikimedia.org/r/542790 (https://phabricator.wikimedia.org/T235399) [07:01:11] (03PS1) 10Marostegui: wmnet: Remove db2068 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/542791 (https://phabricator.wikimedia.org/T235399) [07:01:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:50] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [07:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:55] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2068.codfw.wmnet` - db2068.codfw.wmnet (**FAIL**) - Downtimed host on Icinga - Downt... [07:05:50] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) >>! In T235399#5571734, @ops-monitoring-bot wrote: > cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2068.codfw.wmnet` > - db2068.codfw.wmnet (*... [07:06:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2068 from config [puppet] - 10https://gerrit.wikimedia.org/r/542790 (https://phabricator.wikimedia.org/T235399) (owner: 10Marostegui) [07:06:57] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2068 DNS production entries [dns] - 10https://gerrit.wikimedia.org/r/542791 (https://phabricator.wikimedia.org/T235399) (owner: 10Marostegui) [07:09:01] 10Operations, 10ops-codfw, 10decommission: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) a:05Marostegui→03Papaul [07:09:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Marostegui) Host ready for #dc-ops steps [07:14:52] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:16:34] !log Stop MySQL on labsdb1009 for on-site maintenance - T233273 [07:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:38] T233273: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 [07:17:42] 10Operations, 10DBA: db2068 is misbehaving (but is depooled) - https://phabricator.wikimedia.org/T235366 (10Marostegui) 05Open→03Resolved a:03Marostegui Resolving this as the host has been labelled as broken and sent to DC-Ops for decommissioning T235399 [07:19:14] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:19:38] uh? [07:20:31] Ah right, cause I deleted db2068 from config [07:21:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2068 from config - T235399', diff saved to https://phabricator.wikimedia.org/P9319 and previous config saved to /var/cache/conftool/dbconfig/20191014-072100-marostegui.json [07:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:05] T235399: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 [07:24:22] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:24:59] marostegui: :) [07:26:33] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@c25a1c2]: Do not pre-generate /page/metadata - T235173 [07:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:37] T235173: Stop pregenerating /page/metadata - https://phabricator.wikimedia.org/T235173 [07:27:58] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@c25a1c2]: Do not pre-generate /page/metadata - T235173 (duration: 01m 25s) [07:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:17] !log mobrovac@deploy1001 Started deploy [restbase/deploy@e0d071f]: Remove VE logging and stop using storage for /page/metadata - T234928 T235173 [07:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [07:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 and db2126 to change sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9320 and previous config saved to /var/cache/conftool/dbconfig/20191014-073319-marostegui.json [07:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:24] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [07:41:54] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@e0d071f]: Remove VE logging and stop using storage for /page/metadata - T234928 T235173 (duration: 13m 37s) [07:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:59] T235173: Stop pregenerating /page/metadata - https://phabricator.wikimedia.org/T235173 [07:41:59] T234928: RESTBase sometimes not retaining stashed content? - https://phabricator.wikimedia.org/T234928 [07:42:27] 10Operations: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10MoritzMuehlenhoff) [07:45:08] !log mobrovac@deploy1001 Started deploy [restbase/deploy@4d469a1] (dev-cluster): Remove VE logging and stop using storage for /page/metadata [07:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:06] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@4d469a1] (dev-cluster): Remove VE logging and stop using storage for /page/metadata (duration: 03m 58s) [07:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:08] (03CR) 10Gehel: [C: 04-1] wdqs: add data-reload cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [07:50:44] 10Operations, 10ops-eqiad: maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10MoritzMuehlenhoff) [07:54:04] !log Stop db1074 and db2126 in sync to change sanitarium's master for s2 - T231638 [07:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:07] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [07:54:16] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10Gehel) [07:54:33] ACKNOWLEDGEMENT - IPMI Sensor Status on maps1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Gehel tracked in https://phabricator.wikimedia.org/T235406 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:56:00] (03PS1) 10Elukey: profile::analytics::cluster::client: remove old code [puppet] - 10https://gerrit.wikimedia.org/r/542863 [08:03:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/542863 (owner: 10Elukey) [08:04:59] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: remove old code [puppet] - 10https://gerrit.wikimedia.org/r/542863 (owner: 10Elukey) [08:08:24] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [08:10:35] 10Operations, 10serviceops, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) [08:10:47] 10Operations, 10serviceops, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) p:05Triage→03High a:03Joe [08:17:44] (03PS1) 10Elukey: Move the Analytics Hadoop cluster to the new Analytics ZK cluster [puppet] - 10https://gerrit.wikimedia.org/r/542866 (https://phabricator.wikimedia.org/T217057) [08:21:02] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10MoritzMuehlenhoff) a:05RobH→03Cmjohnson [08:30:25] (03CR) 10Jbond: "LGTM once nuria approves" [puppet] - 10https://gerrit.wikimedia.org/r/542599 (https://phabricator.wikimedia.org/T234529) (owner: 10Herron) [08:30:31] (03CR) 10Jbond: [C: 03+1] admin: add eyener to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/542599 (https://phabricator.wikimedia.org/T234529) (owner: 10Herron) [08:32:28] (03PS1) 10Elukey: ferm: remove hadoop_masters from puppet config [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) [08:32:31] Hi Amir1 :) before banwiki is created, doesn't https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/541527/ need to be merged? (also cc Urbanecm ) [08:33:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542131 (owner: 10Muehlenhoff) [08:34:17] (03CR) 10Jbond: [C: 03+2] puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) (owner: 10Jbond) [08:34:26] (03PS6) 10Jbond: puppet: change $::cluster variable to a hiera default [puppet] - 10https://gerrit.wikimedia.org/r/541213 (https://phabricator.wikimedia.org/T234805) [08:36:01] Hi Jhs, that patch has to be merged during the wiki creation window. Thanks for caring! [08:37:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [08:38:17] Urbanecm, okies (Y) just wanted to check. The patch I submitted for the Scribunto namespaces still hasn't been merged either [08:40:00] (03CR) 10Jbond: "fyi https://gerrit.wikimedia.org/r/541213 is merged now" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [08:41:38] 10Operations: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10jbond) I started on a path for this here but i ran into a further problem i couldn't easily fix https://gerrit.wikimedia.org/r/c/cergen/+/541796 [08:41:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [08:45:51] Jhs: thx, we'll take care of that :) [08:46:33] !log restbase drop metadata keyspaces from cassandra - T235173 [08:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:38] T235173: Stop pregenerating /page/metadata - https://phabricator.wikimedia.org/T235173 [08:47:45] 10Operations, 10Performance-Team, 10media-storage, 10serviceops, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) Yes: ` gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | grep ConnectionTimeout | wc -l 2048 ` [08:48:27] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) is CRITICAL: Test Get metadata from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [08:48:39] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) is CRITICAL: Test Get metadata from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [08:48:57] known ^ [08:49:47] (03PS1) 10Mobrovac: RESTRouter: Use image v1.1.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542874 (https://phabricator.wikimedia.org/T235173) [08:50:59] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Use image v1.1.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542874 (https://phabricator.wikimedia.org/T235173) (owner: 10Mobrovac) [08:51:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1074 and db2126 after changing sanitarium to replicate from db1074 T231638', diff saved to https://phabricator.wikimedia.org/P9322 and previous config saved to /var/cache/conftool/dbconfig/20191014-085143-marostegui.json [08:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:48] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [08:52:38] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [08:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:17] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) 05Open→03Resolved db1125:3312 has been moved under db1074 with the following coordinates (GTID also enabled): ` change master to master_host='db1074.eqiad.wmnet', master_user='repl', ma... [08:54:55] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [08:55:03] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' . [08:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:45] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [08:56:55] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [08:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:36] (03PS2) 10Arturo Borrero Gonzalez: toolforge: Remove updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/542777 (https://phabricator.wikimedia.org/T229261) (owner: 10BryanDavis) [09:14:29] !log Deploy schema change on dbstore1003:3317 [09:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:45] (03PS5) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [09:20:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Remove updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/542777 (https://phabricator.wikimedia.org/T229261) (owner: 10BryanDavis) [09:22:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) @Jclark-ctr you can proceed and change the PSU now. MySQL has been stopped. [09:24:58] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui T233273 https://wikitech.wikimedia.org/wiki/HAProxy [09:34:36] !log Upgraded CI jobs to Quibble 0.0.38 [09:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:41] (03PS1) 10Elukey: kerberos: update keytab location for common Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/542888 [09:38:18] (03CR) 10jerkins-bot: [V: 04-1] kerberos: update keytab location for common Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/542888 (owner: 10Elukey) [09:38:26] uff [09:40:07] (03PS2) 10Elukey: kerberos: update keytab location for common Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/542888 [09:41:06] (03CR) 10Elukey: [C: 03+2] kerberos: update keytab location for common Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/542888 (owner: 10Elukey) [09:42:18] (03PS6) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [09:47:58] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10aborrero) p:05Triage→03High a:03Andrew I believe the last person doing this kind of moveme... [09:48:04] (03CR) 10Mvolz: [C: 03+1] Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [09:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1130 into s5 api, db1100 will be removed later in preparation for tomorrow's failover T234300', diff saved to https://phabricator.wikimedia.org/P9325 and previous config saved to /var/cache/conftool/dbconfig/20191014-094809-marostegui.json [09:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:15] T234300: Switchover s5 primary database master db1070 -> db1100 - 15th Oct 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234300 [09:48:26] (03PS4) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/540762 (https://phabricator.wikimedia.org/T234300) [09:48:36] (03PS3) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/540763 (https://phabricator.wikimedia.org/T234300) [09:51:01] 10Operations, 10Citoid, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Support meta tag refresh redirects in citoid to support elsevier's linking hub - https://phabricator.wikimedia.org/T204032 (10Mvolz) a:05Mvolz→03None [10:05:58] (03PS1) 10Marostegui: db-eqiad.php: Temporary pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) [10:06:27] (03CR) 10Marostegui: [C: 04-2] "Wait until Tuesday 15th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542890 (https://phabricator.wikimedia.org/T227142) (owner: 10Marostegui) [10:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 in preparation for tomorrow's failover T234300', diff saved to https://phabricator.wikimedia.org/P9326 and previous config saved to /var/cache/conftool/dbconfig/20191014-100758-marostegui.json [10:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:04] T234300: Switchover s5 primary database master db1070 -> db1100 - 15th Oct 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234300 [10:08:37] !now [10:09:30] (03PS1) 10Zoranzoki21: Fix wrong domain in wgCopyUploadDomains added in T203363 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542891 (https://phabricator.wikimedia.org/T235415) [10:09:54] 10Operations, 10Analytics: Add metadata to puppet about kerberos accounts - https://phabricator.wikimedia.org/T235418 (10elukey) [10:20:08] jan_drewniak, hi, I see you're doing the portal update soon. The N'Ko Wikipedia (nqowiki) isn't on Wikipedia.org currently – will that be added automatically, or do we need to file a task for it? [10:20:41] (or a patch) [10:28:24] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) After *a lot* of debugging, I've found something that could explain this behaviour. Testing locally with HTTP/1... [10:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T1030). [10:30:11] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542893 (https://phabricator.wikimedia.org/T128546) [10:31:40] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542893 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542893 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:45] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:542893| Bumping portals to master (T128546)]] (duration: 00m 52s) [10:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:49] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:35:37] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:542893| Bumping portals to master (T128546)]] (duration: 00m 51s) [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:19] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) Reported to upstream in https://github.com/apache/trafficserver/issues/6018 [10:55:27] (03PS1) 10KartikMistry: Also add templatemapping to cxserver prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) [10:57:14] (03CR) 10Santhosh: [C: 03+1] Also add templatemapping to cxserver prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T1100). [11:00:05] kart_, tarrow, and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:27] OK! I'll go ahead. Need mw-config + cxserver update. [11:00:54] go ahead [11:00:57] cool; I'll queue up after :) [11:01:23] ll do Zoranzoki's patch [11:01:30] ping me once I can deploy [11:01:43] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) (owner: 10KartikMistry) [11:02:01] (03PS4) 10KartikMistry: Use ContentTranslationEnableMT to disable MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) [11:08:46] (03PS2) 10KartikMistry: Update cxserver to 2019-10-03-054958-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540533 (https://phabricator.wikimedia.org/T232986) [11:09:14] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|538867|Use ContentTranslationEnableMT to disable MT (T232986)]] (duration: 00m 51s) [11:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:18] T232986: Enable all possible language pairs in cxserver, apply wiki specific configuration in the wiki - https://phabricator.wikimedia.org/T232986 [11:09:24] First patch done. [11:10:26] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-10-03-054958-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540533 (https://phabricator.wikimedia.org/T232986) (owner: 10KartikMistry) [11:10:38] (03Merged) 10jenkins-bot: Update cxserver to 2019-10-03-054958-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540533 (https://phabricator.wikimedia.org/T232986) (owner: 10KartikMistry) [11:15:15] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:49] Urbanecm: will take few minutes for cxserver deployment.. [11:15:59] is it okay if I add another config change to the SWAT? [11:16:06] (I can deploy it myself when everything else is done) [11:16:08] kart_: sure [11:16:19] Lucas_WMDE: no objection from me :) [11:17:06] ok :) [11:17:34] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:17:34] Lucas_WMDE or someone else: Could you review&merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/541206, please? [11:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:45] * Lucas_WMDE looks [11:18:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542867 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [11:20:06] ✔done [11:22:37] thanks [11:22:48] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] !log Update cxserver to 2019-10-03-054958-production (T232986) [11:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:57] T232986: Enable all possible language pairs in cxserver, apply wiki specific configuration in the wiki - https://phabricator.wikimedia.org/T232986 [11:27:59] Urbanecm: Done [11:28:13] thanks kart_ , deploying now. [11:28:27] (03PS2) 10Zoranzoki21: Fix wrong domain in wgCopyUploadDomains added in T203363 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542891 (https://phabricator.wikimedia.org/T235415) [11:28:31] Cool; I'll go after Urbanecm; could you ping me when you're done? [11:28:36] sure tarrow [11:28:39] :) [11:28:58] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542891 (https://phabricator.wikimedia.org/T235415) (owner: 10Zoranzoki21) [11:29:31] Hi, I am here for SWAT :) [11:29:50] (03Merged) 10jenkins-bot: Fix wrong domain in wgCopyUploadDomains added in T203363 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542891 (https://phabricator.wikimedia.org/T235415) (owner: 10Zoranzoki21) [11:29:52] Patch no needs testing via x-wikimedia-debug [11:30:05] Ok, it's merged, excellent :) [11:32:49] (03PS1) 10Muehlenhoff: Create a separate component/cergen [puppet] - 10https://gerrit.wikimedia.org/r/542901 (https://phabricator.wikimedia.org/T235405) [11:33:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: a295cc7: Fix wrong domain in wgCopyUploadDomains added in T203363 (T235415) (duration: 00m 51s) [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:50] T203363: Please add http://www.bollywoodhungama.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T203363 [11:33:51] T235415: Copy uploads not working for https://www.bollywoodhungama.com - https://phabricator.wikimedia.org/T235415 [11:33:53] tarrow: the air is clear [11:34:49] cheers! [11:36:13] Lucas_WMDE: I actually need to wait for jenkins to merge; if you have a config patch (or something that beats mine) you could go first? [11:36:21] ok [11:37:09] then I’ll deploy mine [11:37:14] thanks! [11:37:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [11:39:28] (03PS11) 10Lucas Werkmeister (WMDE): Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [11:39:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [11:39:57] Lucas_WMDE: Stupid question from me: I usually watch the logs as deploying from mwlog1001 with `exec fatalmonitor`. Obviously now this doesn't work since it tries to read hhvm.log which is now gone. What is everyone doing now? [11:40:24] (03Merged) 10jenkins-bot: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [11:40:25] tarrow: use https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor [11:40:54] tarrow: T234345 [11:40:54] T234345: Figure out what to do with `fatalmonitor` script - https://phabricator.wikimedia.org/T234345 [11:40:55] if you click "last 15 minutes", then auto-refresh and then 5 seconds, it will be almost-realtime [11:41:03] thanks Lucas_WMDE for linking the task [11:41:39] ah! cool [11:42:08] (03PS1) 10Muehlenhoff: Add package builder hook for cergen [puppet] - 10https://gerrit.wikimedia.org/r/542905 [11:42:24] Urbanecm: that’s good to know, please document it somewhere :) [11:42:37] because I was wondering why I hadn’t seen any other SWAT deployer complain about fatalmonitor being broken :D [11:42:44] :D [11:43:00] Lucas_WMDE: any good example of that "somewhere"? [11:43:13] the task would be a good starting point, I think ^^ [11:43:17] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#SSH_Connections_and_Error_Logs ? [11:43:20] and then we can figure out where to pu tit [11:43:22] *put it [11:43:36] (it already has two links for where `fatalmonitor` is currently recommended) [11:43:52] anyways, looking into my config change now [11:44:07] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 5694 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [11:44:10] testing on mwdebug1002 [11:44:12] Lucas_WMDE: done [11:44:31] PROBLEM - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 5693 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [11:44:51] tarrow: thanks, added to the outdated note [11:45:00] config change seems to work fine, syncing [11:45:25] PROBLEM - DPKG on ununpentium is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:46:16] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:514461|Enable reftabs on testwikidata (T199197, T228412)]] (duration: 00m 51s) [11:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:21] T199197: [2.11] Integrate Citoid in Wikidata - https://phabricator.wikimedia.org/T199197 [11:46:21] T228412: Deploy Citoid Wikibase integration to Test Wikidata - https://phabricator.wikimedia.org/T228412 [11:46:32] tarrow: SWAT is yours [11:46:36] 10Operations, 10Performance-Team, 10serviceops: webperf* running out of disk space - https://phabricator.wikimedia.org/T235425 (10jijiki) [11:46:55] Thanks! I'm still waiting for the butler to finish buttling [11:47:00] (Zuul estimates ten more minutes for the gate-and-submit-wmf :/ ) [11:47:18] ACKNOWLEDGEMENT - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 5688 MB (3% inode=99%): Effie Mouzeli Related Task: T235425 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [11:47:18] ACKNOWLEDGEMENT - Disk space on webperf2002 is CRITICAL: DISK CRITICAL - free space: /srv 5687 MB (3% inode=99%): Effie Mouzeli Related Task: T235425 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [11:49:11] I should have +2'd earlier I guess but it feels odd doing that before I know I'll be deploying [11:49:31] (03PS14) 10Jbond: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [11:50:02] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [11:51:23] (03CR) 10Hoo man: "@Ariel: This is memcache, yes. See also https://phabricator.wikimedia.org/T180048#3755738." [puppet] - 10https://gerrit.wikimedia.org/r/542278 (owner: 10Hoo man) [11:57:11] Amir1: can I be super annoying and eat into a little bit of your slot? [11:57:35] Sure, mine won't take that long (if everything goes right) [11:58:02] ooh, a new wiki! [11:58:09] Thanks! It's my fault for not instructing jenkins earlier [11:58:17] ah, so that’s why I was asked to review the Balinese translations of Scribunto ^^ [11:58:36] Urbanecm: should those translations be backported? [11:59:32] Lucas_WMDE: I _think_ it will survive the train without issues [12:00:04] Amir1: Your horoscope predicts another unfortunate Creating banwiki deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T1200). [12:00:13] tarrow: let me know when you're done [12:00:29] Amir1: Will do! Thanks again for your understanding [12:04:47] 10Operations, 10Puppet: Server volatile uri from local site. - https://phabricator.wikimedia.org/T235427 (10jbond) [12:07:10] 10Operations, 10Puppet: Server volatile uri from local site. - https://phabricator.wikimedia.org/T235427 (10jbond) [12:10:20] tarrow: it looks like Zuul is finally done? [12:10:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542901 (https://phabricator.wikimedia.org/T235405) (owner: 10Muehlenhoff) [12:10:33] !log tarrow@deploy1001 Synchronized php-1.35.0-wmf.1/extensions/Wikibase: SWAT: [[gerrit:542894|Bump up Termbox cache version (T235192)]] (duration: 00m 56s) [12:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:37] T235192: Bump up Termbox Cache version - https://phabricator.wikimedia.org/T235192 [12:10:37] ah, ok :) [12:10:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542905 (owner: 10Muehlenhoff) [12:10:50] Amir1: all done :) [12:11:06] coolio, taking over [12:11:11] Lucas_WMDE: yeah; I was just being a bit quiet [12:11:35] (03CR) 10Effie Mouzeli: "Looks as expected https://puppet-compiler.wmflabs.org/compiler1002/18867/" [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [12:12:24] (03PS5) 10Ladsgroup: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [12:12:43] RECOVERY - DPKG on ununpentium is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:13:34] (03PS7) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [12:13:43] (03PS8) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [12:14:20] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [12:15:12] (03Merged) 10jenkins-bot: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [12:15:39] (03PS6) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [12:16:14] 10Operations, 10Performance-Team, 10serviceops: webperf* running out of disk space - https://phabricator.wikimedia.org/T235425 (10Gilles) p:05Triage→03High [12:16:20] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [12:16:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the version in Chart.yaml as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [12:19:00] (03PS15) 10Jbond: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [12:19:59] !log ladsgroup@deploy1001 Synchronized dblists: Creating banwiki: T234768 (duration: 00m 52s) [12:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:03] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [12:20:42] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [12:22:59] (03PS1) 10Ladsgroup: Add banwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542912 (https://phabricator.wikimedia.org/T234768) [12:23:46] (03CR) 10Muehlenhoff: [C: 03+2] Create a separate component/cergen [puppet] - 10https://gerrit.wikimedia.org/r/542901 (https://phabricator.wikimedia.org/T235405) (owner: 10Muehlenhoff) [12:24:33] (03CR) 10Muehlenhoff: [C: 03+2] Add package builder hook for cergen [puppet] - 10https://gerrit.wikimedia.org/r/542905 (owner: 10Muehlenhoff) [12:24:36] (03CR) 10Ladsgroup: [C: 03+2] Add banwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542912 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [12:25:24] (03Merged) 10jenkins-bot: Add banwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542912 (https://phabricator.wikimedia.org/T234768) (owner: 10Ladsgroup) [12:28:51] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: Creating banwiki: T234768 [12:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:55] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [12:30:09] (03CR) 10Jbond: [C: 03+1] "I updated the spec tests, fixtures and some other missed references. PCC and ci seem happy and looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [12:31:16] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating banwiki: T234768 (duration: 00m 51s) [12:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:43] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: Creating banwiki: T234768 (duration: 00m 51s) [12:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add package builder hook for cergen [puppet] - 10https://gerrit.wikimedia.org/r/542905 (owner: 10Muehlenhoff) [12:33:27] (03PS1) 10Ladsgroup: mediawiki: Disable query page updates for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/542915 (https://phabricator.wikimedia.org/T234948) [12:34:34] !log ladsgroup@deploy1001 Synchronized langlist: Creating banwiki: T234768 (duration: 00m 50s) [12:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:38] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [12:37:31] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542918 [12:37:33] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542918 (owner: 10Ladsgroup) [12:37:35] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542919 [12:37:37] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542919 (owner: 10Ladsgroup) [12:37:39] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542920 [12:37:41] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542920 (owner: 10Ladsgroup) [12:38:55] (03Abandoned) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542919 (owner: 10Ladsgroup) [12:38:59] (03Abandoned) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542920 (owner: 10Ladsgroup) [12:39:04] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542918 (owner: 10Ladsgroup) [12:39:23] What the... [12:40:21] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 04s) [12:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:39] (03PS1) 10Jbond: puppetmaster::frontend serve volatile uri from the locale site forntend [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) [12:43:47] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10Ladsgroup) >>! In T234770#5550592, @Marostegui wrote: > Let us know when the database is created so we can sanitize it on labs hosts Done now \o/ [12:43:59] 10Operations, 10Puppet, 10Patch-For-Review: Serve volatile uri from local site. - https://phabricator.wikimedia.org/T235427 (10jbond) [12:44:26] !log Creating banwiki is banned (done) [12:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] that was a particularly horrible pun [12:45:48] 10Puppet: missing CRL - https://phabricator.wikimedia.org/T235185 (10jbond) p:05Triage→03Normal [12:46:02] 10Puppet: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10jbond) p:05Triage→03Normal [12:47:23] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) [12:47:56] (03PS1) 10Jbond: config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 [12:48:22] (03CR) 10jerkins-bot: [V: 04-1] config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (owner: 10Jbond) [12:49:00] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) [12:50:34] Amir1, lol [12:50:46] Amir1, that means I can start importing now? :D [12:51:39] (03PS2) 10Jbond: config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) [12:52:05] (03CR) 10jerkins-bot: [V: 04-1] config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [12:52:15] (03PS1) 10Jbond: pybal_backend: enable codfw endpoint [puppet] - 10https://gerrit.wikimedia.org/r/542926 (https://phabricator.wikimedia.org/T234315) [12:52:19] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10Marostegui) a:03Marostegui [12:53:20] (03PS3) 10Jbond: config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) [12:53:46] (03CR) 10jerkins-bot: [V: 04-1] config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [12:54:02] Jhs: yes [12:54:08] whee [12:54:49] (03PS4) 10Jbond: config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) [12:56:50] (03PS2) 10KartikMistry: Also add templatemapping to cxserver prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) [12:58:53] (03PS1) 10Jbond: pupetmasters: remove local server config [puppet] - 10https://gerrit.wikimedia.org/r/542929 (https://phabricator.wikimedia.org/T234315) [12:58:59] Amir1: sorry, wasn't watching the process closely, any complications? [13:01:29] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Joe) >>! In T233654#5569671, @mobrovac wrote: > We have to resolve the same problem here to the one we encountered in Beta. Namely, both php-fpm and parsoid services use... [13:02:32] Amir1, the logo is still in English. I noticed that was missing in Urbanecm's InitialiseSettings.php update, so that's probably why? [13:02:49] Jhs: mea culpa, I'll fix that in a while [13:03:01] that's fine, no worries [13:03:38] I actually noticed it before, but thought it may be done some other way by the software. Now I know it doesn't. :) So if I notice it next time, I'll leave a comment in gerrit [13:05:48] thanks Jhs [13:05:51] (03PS1) 10Urbanecm: Add banwiki's logo to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) [13:06:04] (03CR) 10Urbanecm: [C: 03+2] Add banwiki's logo to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [13:06:20] (03CR) 10Lucas Werkmeister (WMDE): "Shouldn’t it be sorted before barwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [13:07:00] thanks Lucas_WMDE [13:07:03] (03PS2) 10Urbanecm: Add banwiki's logo to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) [13:07:18] (03CR) 10Urbanecm: [C: 03+2] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [13:07:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Also add templatemapping to cxserver prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [13:08:05] just in time, apparently ^^ [13:08:14] thanks for fixing it btw [13:08:21] (03CR) 10Effie Mouzeli: [C: 03+2] lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli) [13:08:23] Amir1: outstanding commits in /srv/mediawiki-stagging, could you fix 'em? [13:08:33] (03PS5) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) [13:08:44] in theory, can reset and re-run update iw cache, but...maybe you got a better solution? [13:09:09] (03Merged) 10jenkins-bot: Add banwiki's logo to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542933 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [13:09:11] (03PS3) 10Alexandros Kosiaris: Also add templatemapping to cxserver prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/542897 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [13:10:25] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Jclark-ctr) Replaced failed PSU [13:10:45] (03PS1) 10Alexandros Kosiaris: Publish cxserver-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542934 [13:10:50] hmm, nothing seems to be changed => running git reset --hard origin/master [13:10:53] !log Sanitize banwiki on db1124:3313 and db2094:3313 T234770 [13:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:58] T234770: Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 [13:11:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Publish cxserver-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542934 (owner: 10Alexandros Kosiaris) [13:11:27] (03Merged) 10jenkins-bot: Publish cxserver-0.0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/542934 (owner: 10Alexandros Kosiaris) [13:12:10] !log Run git reset --hard origin/master in /srv/mediawiki-stagging (deleted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/542920 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/542919 from deployment srv, both don't actually change anything => safe to delete) (T234768) [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:14] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [13:12:25] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10mobrovac) Aaah indeed, you are right @Joe. Not sure where I was looking :/ Must have been logged into a Beta instance. Ok, we don't need to do this whole port-switching... [13:13:53] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 245b4e5: Add banwiki logo to IS.php (T234768) (duration: 00m 51s) [13:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:09] Jhs: should be fixed, could you check for me? [13:14:31] Urbanecm, yes. Thank you very much! [13:14:53] you're welcome [13:15:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) Thanks John! The alert recovered: ` Sensor Type(s) Temperature, Power_Supply Status: OK This service is currently in a period of scheduled downtime View Extra Service Notes OK 20... [13:16:31] Urbanecm: banwiki creation done? I will deploy cxserver if yes. [13:16:43] kart_: yes, and my fixing patch is done as well [13:16:49] cool. [13:18:04] !log Disable puppet on mw* to remove php72_only feature flag - T229792 [13:18:06] Urbanecm: sorry, I'm afk. Is there anything I can help with now? [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:07] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [13:18:24] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [13:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:27] Amir1: figured out the commits are abandoned on gerrit now and changed only the comment, so I've deleted them for good [13:18:30] for details, see my log entry [13:19:44] (03PS1) 10Jbond: profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 [13:19:49] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [13:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:53] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10Marostegui) I have sanitized this wiki, but before adding the grants and creating the `_p` database I am running a check to make sure all the info... [13:20:05] (03PS2) 10Jbond: profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 [13:21:41] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [13:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:05] Thanks [13:23:55] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Jclark-ctr) 05Open→03Resolved [13:23:57] 10Operations, 10ops-eqiad: Power issue in eqiad A1 - https://phabricator.wikimedia.org/T233248 (10Jclark-ctr) [13:23:59] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10Jclark-ctr) [13:25:26] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/542939 [13:25:33] (03PS2) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/542939 [13:25:41] (03PS1) 10Jbond: profile::base: remove puppetmaster parameter [puppet] - 10https://gerrit.wikimedia.org/r/542940 [13:26:11] !log imported python-networkx 1.11-2~wmf1 to component/cergen for buster-wikimedia T235405 [13:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] T235405: Build cergen for buster - https://phabricator.wikimedia.org/T235405 [13:28:28] (03PS9) 10Effie Mouzeli: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [13:30:19] (03PS1) 10Muehlenhoff: Actually install pbuilder hook [puppet] - 10https://gerrit.wikimedia.org/r/542941 [13:31:30] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [13:32:33] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) Thanks, could we try upgrading the BIOS/firmware initially on 2002? Maybe tomorrow (I'd prepare the server so that it can be taken down witho... [13:33:18] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/compiler1002/18869/" [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [13:34:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542941 (owner: 10Muehlenhoff) [13:36:08] !log Slowly enable puppet on mw* canaries [13:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10WMDE-leszek) As an Engineering Manager at WMDE I endorse this request. [13:37:59] (03PS2) 10Muehlenhoff: Actually install pbuilder hook [puppet] - 10https://gerrit.wikimedia.org/r/542941 [13:39:36] (03CR) 10Muehlenhoff: [C: 03+2] Actually install pbuilder hook [puppet] - 10https://gerrit.wikimedia.org/r/542941 (owner: 10Muehlenhoff) [13:40:11] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/542939 (owner: 10Marostegui) [13:40:19] (03PS3) 10Marostegui: Revert "dbproxy1011: Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/542939 [13:42:18] (03PS1) 10Jbond: profile::base::puppet: ensure variables all exist in module namespace [puppet] - 10https://gerrit.wikimedia.org/r/542944 [13:42:47] !log Repool labsdb1009 after PSU replacement - T233273 [13:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:55] T233273: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 [13:46:30] (03CR) 10Andrew Bogott: "Removing this from prod is probably fine; note that that variable is used throughout wmcs (e.g. to determine if a VM is managed by a proje" [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [13:47:33] (03PS1) 10Jbond: profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 [13:48:20] !log imported cergen 0.2.4-1+deb10u1 to component/cergen for buster-wikimedia T235405 [13:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:24] T235405: Build cergen for buster - https://phabricator.wikimedia.org/T235405 [13:48:57] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [13:49:55] (03CR) 10jerkins-bot: [V: 04-1] profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 (owner: 10Jbond) [13:51:22] (03CR) 10Andrew Bogott: "Yep, I think that's right." [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [14:03:09] (03PS2) 10Jbond: profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 [14:06:10] (03CR) 10jerkins-bot: [V: 04-1] profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 (owner: 10Jbond) [14:10:07] (03PS2) 10Jbond: profile::base::puppet: ensure variables all exist in module namespace [puppet] - 10https://gerrit.wikimedia.org/r/542944 [14:10:42] 10Operations, 10Citoid, 10Release Pipeline, 10Services, 10serviceops: Migrate citoid and zotero services to helm ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10jijiki) p:05Triage→03Normal [14:10:45] (03PS3) 10Jbond: profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 [14:11:04] 10Operations, 10observability, 10Patch-For-Review: Hosts in puppet with $cluster missing from wikimedia_clusters - https://phabricator.wikimedia.org/T234232 (10jijiki) p:05Triage→03Normal [14:12:57] (03CR) 10jerkins-bot: [V: 04-1] profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 (owner: 10Jbond) [14:13:13] !log Enable puppet on mw* servers and reload apache - T229792 [14:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:18] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [14:14:51] (03PS4) 10Jbond: profile::base::puppet: switch profile top automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 [14:15:33] (03PS5) 10Jbond: profile::base::puppet: switch profile to automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 [14:17:58] (03CR) 10jerkins-bot: [V: 04-1] profile::base::puppet: switch profile to automatic parameter binding [puppet] - 10https://gerrit.wikimedia.org/r/542945 (owner: 10Jbond) [14:18:15] 10Operations, 10Puppet, 10Patch-For-Review: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10Reedy) [14:21:34] !log Deploy schema change on db1116:3317 T234066 T233135 [14:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:39] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [14:21:39] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [14:25:32] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10jijiki) p:05Triage→03Normal [14:28:16] !log upload matomo 3.11 to stretch-wikimedia and upgrade matomo1001 - T234607 [14:28:16] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10jijiki) @CorinnaHillebrand_WMDE please sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document, as well as the NDA. Also, please provide your wiki... [14:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] T234607: Upgrade matomo to its latest upstream version - https://phabricator.wikimedia.org/T234607 [14:28:27] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10jijiki) p:05Triage→03Normal [14:28:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3314 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9329 and previous config saved to /var/cache/conftool/dbconfig/20191014-142843-marostegui.json [14:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:48] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [14:29:49] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jijiki) @ JKumalah is this task resolved? [14:29:58] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jijiki) p:05Triage→03Normal [14:32:07] (03PS1) 10Jbond: profile::pupetmaster::frontend: manage ca.pem used in apache config [puppet] - 10https://gerrit.wikimedia.org/r/542954 (https://phabricator.wikimedia.org/T234332) [14:32:35] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) [14:32:46] (03CR) 10jerkins-bot: [V: 04-1] profile::pupetmaster::frontend: manage ca.pem used in apache config [puppet] - 10https://gerrit.wikimedia.org/r/542954 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [14:34:20] (03PS6) 10Ayounsi: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [14:34:42] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10jijiki) p:05Triage→03Normal [14:35:45] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10TechCom, 10Release-Engineering-Team (Development services): Expand Gerrit Manager permissions - https://phabricator.wikimedia.org/T234474 (10jijiki) p:05Triage→03Normal [14:35:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10jijiki) p:05Triage→03Normal [14:36:21] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10jijiki) p:05Triage→03Normal [14:36:55] (03PS2) 10Jbond: profile::pupetmaster::frontend: manage ca.pem used in apache config [puppet] - 10https://gerrit.wikimedia.org/r/542954 (https://phabricator.wikimedia.org/T234332) [14:37:51] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/18875/" [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [14:38:11] (03CR) 10Ayounsi: profile:bird::anycast_healthchecker_monitoring: add python3-docopt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [14:38:40] 10Operations, 10Services, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 2 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10akosiaris) p:05Triage→03High [14:39:13] 10Operations, 10Services, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 2 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10akosiaris) [14:39:20] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) [14:40:38] (03PS1) 10Ladsgroup: mediawiki: Split cronjob for updatequerypages to multiple modules [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) [14:41:34] 10Operations, 10Services, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 2 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10akosiaris) For what is worth, the poolcounter approach is probably the saner one long term. And per https://... [14:41:40] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) In the interest of splitting off from this task what is probably going to be so... [14:42:02] (03CR) 10Ayounsi: [C: 03+1] profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [14:43:38] (03CR) 10Elukey: "Completely forgot about this one, let's merge?" [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [14:45:47] 10Operations, 10Phabricator, 10Traffic: Access Forbidden to Phabricator at WikiArabia 2019 (Morocco) - https://phabricator.wikimedia.org/T234598 (10jijiki) p:05Triage→03High [14:46:37] (03PS2) 10Ladsgroup: mediawiki: Split cronjob for updatequerypages to multiple modules [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) [14:50:12] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18877/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/542956 (https://phabricator.wikimedia.org/T234948) (owner: 10Ladsgroup) [14:52:41] 10Operations, 10conftool: remove service objects from etcd and update documentation - https://phabricator.wikimedia.org/T233973 (10Joe) p:05Triage→03Normal [14:54:24] (03PS1) 10Muehlenhoff: Add component/cergen to cergen class on buster [puppet] - 10https://gerrit.wikimedia.org/r/542960 [14:55:11] (03CR) 10jerkins-bot: [V: 04-1] Add component/cergen to cergen class on buster [puppet] - 10https://gerrit.wikimedia.org/r/542960 (owner: 10Muehlenhoff) [14:57:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [14:59:47] 10Operations, 10Puppet, 10Patch-For-Review: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10jijiki) p:05Triage→03Normal [15:02:03] 10Operations, 10Puppet, 10Patch-For-Review: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10MoritzMuehlenhoff) Adding @BBlack, @ema, @Vgutierrez for explicit input/signoff wrt the GeoIP sub directory. [15:03:06] (03CR) 10Muehlenhoff: "The patch looks fine to me, but let's wait for confirmation wrt the GeoIP sub directory, I've pinged Traffic people on task." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [15:03:53] (03CR) 10Jbond: [C: 03+2] config-master: move eqsin and eqiad config master back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/542924 (https://phabricator.wikimedia.org/T234315) (owner: 10Jbond) [15:05:11] (03PS2) 10Jbond: puppetmaster::frontend serve volatile uri from the locale site frontend [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) [15:05:30] (03PS2) 10Muehlenhoff: Add component/cergen to cergen class on buster [puppet] - 10https://gerrit.wikimedia.org/r/542960 [15:05:43] (03CR) 10Jbond: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [15:08:24] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) What next steps are necessary for me to be able to sign the NDA? [15:15:17] (03PS3) 10Muehlenhoff: Add component/cergen to cergen class on buster [puppet] - 10https://gerrit.wikimedia.org/r/542960 [15:17:29] 10Operations: Improve management of users/groups on servers in production - https://phabricator.wikimedia.org/T235161 (10jijiki) p:05Triage→03Normal [15:18:29] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10WMDE-leszek) @RStallman-legalteam Could you please send the NDA to @CorinnaHillebrand_WMDE (email address visible in https://tools.wmflabs.org/ldap/user/coh... [15:23:02] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:23:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/542960 (owner: 10Muehlenhoff) [15:25:02] 10Operations, 10Analytics: Add metadata to puppet about kerberos accounts - https://phabricator.wikimedia.org/T235418 (10JAllemandou) p:05Triage→03Normal [15:30:34] (03CR) 10Muehlenhoff: [C: 03+2] Add component/cergen to cergen class on buster [puppet] - 10https://gerrit.wikimedia.org/r/542960 (owner: 10Muehlenhoff) [15:32:14] 10Operations: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10jijiki) p:05Triage→03Normal [15:44:14] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:47:25] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10jijiki) p:05Triage→03Normal [15:47:50] 10Operations, 10Performance-Team, 10serviceops: webperf* running out of disk space - https://phabricator.wikimedia.org/T235425 (10Krinkle) My guesses: * higher sampling rate from php7-excimer (compared to hhvm/xenon) is creating significantly larger files than we projected for in T199853. * general increase... [15:48:28] 10Operations, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) [15:48:31] 10Operations, 10Release-Engineering-Team, 10Scap, 10Wikimedia-General-or-Unknown, 10serviceops: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10jijiki) p:05Triage→03Normal a:03jijiki [15:49:00] 10Operations, 10Release-Engineering-Team, 10Scap, 10Wikimedia-General-or-Unknown, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10jijiki) [15:49:14] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10jijiki) p:05Triage→03Normal [15:49:35] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10jijiki) p:05Triage→03High [15:49:51] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10jijiki) p:05Triage→03Normal [15:50:14] 10Operations, 10serviceops: Increase of varnish-be failed fetches error due to "http format error" - https://phabricator.wikimedia.org/T235254 (10jijiki) p:05Triage→03Normal [15:50:30] 10Operations: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10jijiki) p:05Triage→03Normal [15:53:15] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10jijiki) @mepps @Aklapper Which one should we use, environmental-sustainability@ or sustainability@ ? [15:57:10] !log imported cergen 0.2.4-1+deb10u2 to component/cergen for buster-wikimedia T235405 [15:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:14] T235405: Build cergen for buster - https://phabricator.wikimedia.org/T235405 [16:00:46] !log Password reset for Xaris333 (T235441) [16:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:57] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10jijiki) @nnikkhoui you need to create an account on https://wikitech.wikimedia.org and let us know the username. Please choose your wikitech username and shell name (uid) very wi... [16:02:10] 10Operations, 10LDAP-Access-Requests: LDAP membership for new employee Nikki Nikkhoui - https://phabricator.wikimedia.org/T235136 (10jijiki) p:05Triage→03Normal [16:05:54] (03PS1) 10Jbond: adduser: create module to manage /etc/adduser.conf [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) [16:05:56] (03PS1) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [16:07:17] (03CR) 10jerkins-bot: [V: 04-1] adduser: create module to manage /etc/adduser.conf [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:07:47] !log imported cergen 0.2.4-1+deb10u3 to component/cergen for buster-wikimedia T235405 [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:51] T235405: Build cergen for buster - https://phabricator.wikimedia.org/T235405 [16:14:00] (03PS2) 10Jbond: profile::base: add adduser module to profile:base [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) [16:17:42] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:19:10] OA~.~.~.~. [16:19:14] OA~.~.~. [16:19:25] jbond42: password? [16:19:32] (03PS1) 10Muehlenhoff: Install python3-lib2to3 on buster [puppet] - 10https://gerrit.wikimedia.org/r/542987 (https://phabricator.wikimedia.org/T235405) [16:19:49] no just trying to close a dead )or apprently) not so dead session [16:19:57] xd [16:20:58] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10Papaul) Yes we can do this tomorrow I will ping you once on site tomorrow. [16:28:18] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:28:24] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) @jijiki yes it is [16:31:18] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jijiki) 05Open→03Resolved a:03jijiki [16:40:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10Nuria) [16:41:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10Nuria) @gsingers In order to get an ldap user you need to create a user at http://wikitech.wikimedia.org , paste that user in this ticket once you have it, someo... [16:45:40] (03PS1) 10Elukey: Add upstream patch to avoid segfaults on Debian Stretch [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/542992 (https://phabricator.wikimedia.org/T223863) [16:47:58] (03CR) 10Muehlenhoff: [C: 03+2] Install python3-lib2to3 on buster [puppet] - 10https://gerrit.wikimedia.org/r/542987 (https://phabricator.wikimedia.org/T235405) (owner: 10Muehlenhoff) [16:52:42] (03CR) 10Muehlenhoff: profile::base: add adduser module to profile:base (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [17:00:04] gehel and onimisionipe: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T1700). [17:00:11] jouncebot: here here [17:01:43] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10MoritzMuehlenhoff) networkx has some breaking API changes between 1.x and 2.x which are non-trivial to resolve. To unbreak the use of cergen on buster the build has been adapted to use a forward-ported 1.11... [17:01:47] \o/ [17:04:27] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@217cac5]: New blazegraph build and GUI updates [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:16] (03PS2) 10Muehlenhoff: Update late.sh for hosts no longer using puppet 4 client packages [puppet] - 10https://gerrit.wikimedia.org/r/542131 [17:15:09] (03CR) 10Muehlenhoff: [C: 03+2] Update late.sh for hosts no longer using puppet 4 client packages [puppet] - 10https://gerrit.wikimedia.org/r/542131 (owner: 10Muehlenhoff) [17:21:13] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@217cac5]: New blazegraph build and GUI updates (duration: 16m 45s) [17:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:56] (03PS2) 10Elukey: Add upstream patch to avoid segfaults on Debian Stretch [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/542992 (https://phabricator.wikimedia.org/T223863) [17:41:53] 10Operations, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Dzahn) looking at them now i see they are only using 14% and 8% of / . I ran "apt-get clean" and now it's down to 12% and 6%. Alerting would be at 95% by defa... [17:44:50] 10Operations, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10elukey) I noticed the warning this morning as well and pinged Gilles on IRC, the issue seems to be in the /srv partition: ` /dev/vdb 147G 135G 4.4G... [17:53:39] 10Operations, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Dzahn) sorry, i was on 1001 and 2001 vs. 1002 and 2002 and was wondering why i don't even see /srv mounted on a separate device. yes, ACK. on 1002 / 2002 it... [17:55:56] RECOVERY - Disk space on webperf2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf2002&var-datasource=codfw+prometheus/ops [17:56:12] !log webperf2002 - /srv/xenon/logs/daily# gzip 2019-09*excimer*.log (T235425) [17:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:16] T235425: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 [17:58:43] (03PS1) 10Ema: ATS: set inbound_tls_settings for labs [puppet] - 10https://gerrit.wikimedia.org/r/542994 (https://phabricator.wikimedia.org/T234256) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T1800). Please do the needful. [18:00:04] duesen: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:44] (03CR) 10Ema: [C: 03+2] ATS: set inbound_tls_settings for labs [puppet] - 10https://gerrit.wikimedia.org/r/542994 (https://phabricator.wikimedia.org/T234256) (owner: 10Ema) [18:00:50] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [18:08:54] 10Operations, 10Traffic, 10Patch-For-Review: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10ema) Thanks for filing this task @Andrew! I did add `profile::trafficserver::tls::inbound_tls_settings` to `... [18:10:14] 10Operations, 10Puppet, 10Traffic, 10serviceops: Puppet systemd::mask is an anti pattern that has unwanted side effect - https://phabricator.wikimedia.org/T233839 (10ema) 05Open→03Stalled [18:20:30] (03PS1) 10Ema: ATS: lower compress plugin minimum-content-length [puppet] - 10https://gerrit.wikimedia.org/r/542996 (https://phabricator.wikimedia.org/T232615) [18:24:31] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:28:19] (03CR) 10D3r1ck01: [C: 03+1] "This makes sense to me but I may be missing something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [18:29:37] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) Hello Reuven and welcome to the team! this is your onboarding ticket. Let's start things with creating a Wikimedia Developer account for you. Please see https://wiki... [18:31:14] (03CR) 10DannyS712: "> This makes sense to me but I may be missing something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [18:31:21] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:34:18] Uh, is swat happening? [18:34:36] Seems like I forgot the gerrit hash tag after putting my patch on the calendar. [18:34:38] silly me [18:35:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10wiki_willy) @Jclark-ctr @Marostegui - thanks guys [18:36:25] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:37:46] MaxSem, RoanKattouw, Niharika, Urbanecm: ping [18:37:51] 10Operations, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) a:03mobrovac [18:38:11] duesen: since it's a wmf holiday there is no swat [18:38:42] mobrovac: meh. why isn't that represented on the calendar page? [18:38:58] that's a good question [18:39:00] PEBKAC? [18:39:04] i guess [18:39:17] I'll tag it for tomorrow [18:39:27] what's the change you want to swat? [18:39:38] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542963 [18:40:07] oooh [18:40:14] no no, that'll have to wait tomorrow [18:40:38] i'd recommend putting it for the eu slot [18:41:31] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [18:41:38] yes, will do [18:42:03] the cache corruption is nasty, there is no good way to purge that. [18:42:20] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) - added to maint-announce shared inbox / Google group - added to "Ops vendor maintenance" calendar and permissions [18:46:43] (03CR) 10D3r1ck01: [C: 03+1] "> Basically all TAs are either sysops or also autopatrolled, so there wouldn't be a big difference" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [19:17:20] mutante: please don't run the same gzip command on the other server [19:17:25] explanation incoming [19:18:02] duesen: it sounds serious enough to warrant an emergency deploy [19:18:21] Krinkle: i'm afraid it's too late but that was only on files from September and i'm already off the server again [19:18:22] Your call [19:19:49] mutante: these are processed by cronjobs and automated proceses, renaming them means they are no longer found and reacted to as if they were deleted, with artefacts auto-deleted. We just lost last months flame graphs I noticed, and analysis tools can no longer compare against september. [19:19:55] MaxSem: I can't tell how wide spread the effect is on wmf servers. as far as I know, it is only triggered by running maintenance scripts for the translate extension. it was cripling for TWN [19:20:00] the last part can be undone, the earlier part is harder. [19:20:13] MaxSem: but yes, it's bad, since it also affects the text loaded for editing. [19:20:21] EEK [19:20:22] for future reference, webperf1002 is more valuable in broken state with last months' files, thatn in working state with only this months files. [19:21:00] duesen: Do we run these scripts in production? [19:21:05] This needs documenting and you couldn't have known. No worries :) [19:21:05] MaxSem: I'd be happy to see it deployed asap, if you are happy to do it on a holiday :) [19:21:34] MaxSem: i did not think so, but apprarently we do. or there is something else going on which we haven't figured out yet [19:21:58] I can deploy, but I need to run afterwards so can't babysit it [19:22:10] oh - we don't run the export scripts, but we do run the fuzzy script, i think [19:22:26] that doesn't affact as many pages, but it does affect some [19:22:37] An easy plug would be to disable the scripts [19:22:59] if you know how to do that, and how to make sure they are enabled again, and how this is communicated to the community, and... [19:23:21] Krinkle: ok, got it. i certainly did not expect the auto-delete part. zipping and not moving them was the attempt to quick fix it until we fix logrotate. i won't touch them again. do you want me to unzip stuff? [19:24:20] I don't know. Right now I am concerned about data retention, I don't think we have backups, leave it be for now. [19:24:34] duesen: Tell me which scripts [19:24:41] These also are not acted on by logrotate. [19:24:47] It's application data from arclamp. [19:24:51] It just looks like log files. [19:25:06] MaxSem: any script in extension/Translate. I assume the one running on wmf is called fuzzy.php [19:25:27] ...and populateFuzyy.php [19:26:12] I see only characterEditStats.php in Puppet [19:27:37] Nikerabbit: you here? [19:27:44] I mean, I can undeploy Translate completely... :} [19:27:52] \o/ [19:28:15] lolz @ undeploy translate [19:28:40] can you also undeploy flow as well while you're at it [19:29:03] MaxSem: digging through the code, there may be ways to trigger the bad code path via special pages [19:32:32] MaxSem: ok, looks like i was wrong. it'S not used for page views, but can be triggered via special pages. [19:32:44] at least, I'm not sure it can't [19:32:59] i'll change the comment on the patch to reflect that [19:33:59] done [19:36:18] Krinkle: i checked bacula console but no webperf clients. sorry about it, it was meant to prevent data loss not create it [19:37:04] (03CR) 10DannyS712: "> > Basically all TAs are either sysops or also autopatrolled, so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 (owner: 10DannyS712) [19:37:26] we could add it as a backup::host to change that [19:38:04] Yes, that is among the many things we need :) [19:38:26] in general I'd prefer webperf servers to be down and keep the data they have in the form they have it and not be able to ingest new data. [19:38:40] as "perf is not acount, what to do" fallback scenario [19:38:42] around* [19:40:51] MaxSem: now what? [19:41:10] ok. if it happens again we i would just leave it completely hands off or we can shut them down [19:41:44] we can leave that kind of comment in the "runbook" link [19:41:55] Honestly, deploying a performance degrading patch on a holiday sounds scary [19:42:37] so if it alerts people would then get the link, also on IRC [19:43:12] 10Operations, 10Performance-Team, 10serviceops: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) @dzahn These are not typical "log" files, they are active application data for arc lamp, public on perf.wm.o and consumed or reacted to by automated... [19:43:35] mutante: Yeah, I don't think we have a dedicated alert for this one yet? [19:43:35] 10Operations, 10Traffic: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) The ability to reload global lua scripts has been added in 9.x only: https://github.com/apache/trafficserver/commit/e6147753cd65c3edd32b365e09b4d65edcffdd01 This explains why reloading a... [19:44:00] Our runbook for webperf*002 is https://wikitech.wikimedia.org/wiki/Performance/Runbook/Webperf-tools_services [19:44:07] but it does not cover disk space scenario [19:44:27] Krinkle, mutante: Any opinions re emergency deploy? ^ [19:44:45] Krinkle: yea, true. that's because that is a default check added in "base" and not the specific role [19:44:56] MaxSem: TLDR? [19:45:18] Krinkle: the patch we discussed [19:45:28] Daniel wants to deply https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/542328/ [19:45:31] MaxSem suggested an emergency deploy [19:45:55] but? [19:45:56] Kills caching, not sure how safe it is while lots of ppl are away [19:46:14] fwiw, this didn't have caching before [19:46:30] Translate used to hit the database every time for this [19:46:48] caching for bulk operations is new [19:46:50] The method didn't exist before so that's not a fair comprison. What did its new consumers do previously? [19:47:33] https://codesearch.wmflabs.org/deployed/?q=%5CbgetBlobBatch%5Cb&i=nope&files=&repos= [19:47:35] used in revisionStore only [19:47:46] revisionStore getSlotRowsForBatch is private [19:48:04] Krinkle: the new consumers previously hit the text table, uncached [19:48:25] Too many layers of indirection in this code, cannot easily reason about impact. [19:48:25] the consumer being a couple of placesin the transate extension [19:48:38] i can't even tell where it is used with confidence. [19:48:56] i can, up to MessageCollection. [19:49:02] but further then that, i don't know [19:49:11] but i know that MessageCOllection didn't do caching before [19:49:22] the private method is exposed in newRevisionsFromBatch RevStore method and getContentBlobsForBatch public methods [19:49:24] those two [19:49:32] https://codesearch.wmflabs.org/search/?q=%5Cb(newRevisionsFromBatch%7CgetContentBlobsForBatch)%5Cb&i=nope&files=&repos= [19:49:55] which are ref'ed in 6 different Translate extension files [19:50:38] If Lang-Eng says all these are only used outside prod and/or in maintenance scripts [19:50:54] that would be signal for me that it worth the bet that it won't cause issues. [19:50:57] have we got that signal? [19:52:35] nope. I *thought* I was sure about that, but I'm not. It's possible that it can be triggered via a special page. [19:52:50] but as i said - it wouldn't be worse than before the introduction of these methods. [19:53:57] duesen: OK, so without that signal and risk reduction, what it the other side - how big is the impact of not fixing it? [19:54:16] It seems the severity of the impact and the low-ness of the risk are tightly linked in a way that no action might make sense. [19:55:21] also, would it be feasible to replace with a getBlob loop that doesn't do batching but keeps caching? [19:55:25] under certain circumstabnces, the wrong revisions (and wrong page's) wikitext gets cached, and will be used instead of the real text. for everything, including editing [19:55:39] you mean by non-batch consumers? [19:55:55] it duplicted the code but uses the same cache key [19:55:55] this only happens when the preemptive refresh thing goes wrong, and only when the batch code is triggered. [19:56:05] i have no idea how often that happens in prod [19:56:06] ok, that's pretty bad yes. [19:56:20] same cache key, yes [19:56:35] Scribuntu module source code is read from this right? And tempalte transclusion, and gadget JS files etc. [19:57:00] Krinkle: *everything* is read from this cache. it'S the low level blob store cache [19:57:11] but the problem only affects pages managed by ttranslate [19:57:23] affects or is triggered b y [19:57:36] as far as i can tell, the confusion can only happen *between* pages managed by translate [19:57:44] I see [19:57:45] which may include templates. but not scripts, hopefully [19:58:13] and the reason we can't revert the blobstore patch is because Translate depends on the new method [19:58:38] technically, we can. translate does feature detection based on the method name. [19:58:44] (03PS1) 10Dzahn: webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 [19:58:54] cool [19:59:01] That might be the easiest then? [19:59:08] touches more code [19:59:19] in my mind, that's actually more risky [19:59:35] also more than one patch to revert [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T2000). [20:00:38] Krinkle: or we emergency-deploy the *real* fix :) [20:05:38] (03PS2) 10Dzahn: webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) [20:08:30] *sigh* [20:08:31] so [20:08:35] MaxSem: still around? [20:08:40] duesen: ok, current patch intent is fine, let me CR [20:10:33] ... [20:10:54] duesen: why no longer assertSame()? [20:11:01] did the return value change? [20:11:26] Krinkle: the patch against master has a CR-2 by MaxSem at the moment. so this patch ouwld have to go into the permanent-patch-mechanism, unless the master patch gets merged [20:11:50] It doesn't anymore [20:11:52] Krinkle: the order changed. isn't rellevant for an assoc array, but assertSame checks it [20:12:06] duesen: uh what? [20:12:10] MaxSem: oh ok [20:12:20] but jenkis hasn't picked it up again. [20:12:21] grr [20:12:33] Nikerabbit: We wanna undeploy Translate [20:12:41] (/s) [20:12:51] * duesen slaps MaxSem with a large trout [20:13:38] Nikerabbit: since the caching bug affects wmf too, we were trying to find out whetehr this code can be trigegred via web requests. i thought it was only maintenance scripts, but now i think that's not true [20:13:56] MessageCollection::filter is used by special pages, right? [20:14:18] duesen: Special:Translate uses MessageCollection [20:14:32] so def it can be triggered during web requests [20:14:36] right. i suppose that is why we also see cache corruption on wmf servers [20:14:51] and i see no good way to purge the cache in production. too much collateral... [20:16:09] Nikerabbit: thanks! [20:16:13] Krinkle: --^ [20:18:05] only wikis with Translate enabled are affected right? mw.org, meta, commons? We can probably purge those fairly rapidly in Varnish if needed. [20:18:12] Purging the ParserCache is trickier though [20:18:18] and also required. [20:19:12] on translatewiki.net I purged SqlBlobStore, diffs and parser cache entries [20:19:21] Krinkle: it's not the html output that is problematic. it'S the internal cache of the wikitext blobs. and diffs. [20:19:49] duesen: sure, but varnish and parser output is where that poisoned data ended up used, no? [20:19:57] so link tables are also wrong potentially [20:20:31] Krinkle: if the blob cache doesn't get purged first, everythign else will be poisoned again [20:20:47] yes, links tables may be affected as well [20:20:49] right, sure, we *also* need to purge the source [20:20:57] k [20:21:03] I'll update the core patch re tests [20:21:19] k [20:21:25] can we leave the prod patch? [20:21:52] yes [20:22:15] MaxSem: are you doing deploy window? [20:22:34] note that there is no new expected order, the order is underfined per the method's contract [20:22:34] * Krinkle sees some crossed out entries [20:22:37] Not window, I asked whether an emergency deploy is needed [20:22:49] Today's a holiday in the US [20:23:02] duesen: sure, but it's deterministic seems trivial to just codify, or call ksort() explicitly [20:23:07] less indirection is better imho [20:23:14] Would've been much less worry on a work day [20:23:34] ok [20:23:40] anyway, yeah, go ahead. [20:23:56] the purging maybe not today though [20:24:43] Nikerabbit: how did you purge SqlBlobStore - I assume you mean from memcached [20:25:11] today is a US holiday, tomorrow is national burger day in my street. there's always a party somewhere. [20:25:24] heh [20:25:43] Question is: who's gonna monitor performance? [20:26:06] I need to go soon [20:26:13] the use cases aren't perf sensitive and didn't have caching until recently, It's fine. [20:26:52] but I'll take a look in the hour after deploy [20:27:39] Krinkle: https://phabricator.wikimedia.org/T235188#5572985 [20:28:27] he, okay, neat. I won't use that in prod though, but SRE might. [20:28:28] Thanks :) [20:30:37] Krinkle, Nikerabbit: I updated the test cases for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542328 [20:30:41] if you want to +2... [20:32:37] MaxSem, Nikerabbi, Krinkle: thank you all for the late night help with this! [20:32:47] duesen: Pulled on mwdebug1002, please test [20:34:07] MaxSem: I don't hape permissions to use Special:Translate on meta [20:34:20] Nikerabbit: ^^^ [20:34:22] Nikerabbit: can you give a quick check that this isn'totally broken [20:34:28] and the "changed" filter still works? [20:34:38] on mwdebug1002, that is. [20:34:46] ...so we don't pollute the cache even more [20:35:09] duesen: Special:Translate works as far as I can see [20:35:19] looks fine to me as well on mwdebug1002 [20:35:30] ok [20:35:36] there is no way to actually test for the bug [20:35:39] it's stochastic [20:35:47] but at least we didn't break the site :) [20:36:05] Well, we haven't deployed yet either ;) [20:36:07] wait Special:Translate lets me translate wiki articles!?! :] [20:36:41] hashar: they are called pages, but yes :) [20:37:13] It also blocks vandals and makes you coffee [20:37:35] Nikerabbit: btw while digging through the code i realized that the "chanegd" filter could be made to work based on sha1. just in case you want to fiddle with it ;) [20:37:38] OK, so I'm good to push? [20:37:51] seems like it [20:38:13] dont forget to check logstash after a few minutes [20:38:57] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [20:39:10] duesen: I thought we found out that nothing uses the "changed" filter currently (though it's exposed in the API) [20:39:37] not exactly error-free [20:40:03] it usually contains 0 mw log entries when testing patches as most code paths emit 0 warnings for simple cases [20:40:17] There's a few DPerf violations in Echo and Translate that are known and can be ignored [20:40:33] "Deferred update AtomicSectionUpdate_MediaWiki\Storage\PageUpdater::getAtomicSectionUpdate failed: Main slot of revision 19459753 not found in database!" [20:40:36] looks odd though [20:40:38] is that normal? [20:40:48] Krinkle: yes, that's the other bug [20:40:54] k [20:41:05] https://phabricator.wikimedia.org/T235027 this one [20:41:12] Use of Revision::getRevisionText was deprecated in MediaWiki 1.32. [Called from ThinMessage::translation in /srv/mediawiki/php-1.35.0-wmf.1/extensions/Translate/Message.php at line 170] [20:41:14] huh? [20:41:20] I thought we killed thouse? [20:41:27] duesen: yes the same bug I just linked [20:41:31] !log maxsem@deploy1001 Synchronized php-1.35.0-wmf.1/includes/: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/542963/ (duration: 00m 55s) [20:41:32] oh ok [20:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:41] ah, because of the fallback. right [20:42:07] Please test prod, ladies and gents [20:42:08] * hashar vanishes [20:42:26] * Nikerabbit goes to sleep /s [20:42:58] for translate MaxSem ? [20:43:42] it worked, apparently: https://meta.wikimedia.org/w/index.php?title=Help:Two-factor_authentication/es&diff=prev&oldid=19459768 [20:44:28] Weeee [20:44:34] hauskater: we didn't fix that bug, but great :D [20:44:47] I just did a null edit [20:44:58] and my previous translation updates were published [20:45:09] Yes, that's the workaround for https://phabricator.wikimedia.org/T235027 [20:45:10] if you let me know what do you exactly want to be tested I may assist [20:45:39] yesterday even null-edit-ing didn't worked so it's something :) [20:45:53] Well we would like to be sure that there is no more cache corruption, but excluding that we need to ensure site is still up [20:46:11] (03PS1) 10Daniel Kinzler: Set testwiki to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543007 (https://phabricator.wikimedia.org/T198558) [20:46:54] * hauskater translate a new unit [20:49:33] new translation unit: https://meta.wikimedia.org/wiki/Translations:Help:Two-factor_authentication/50/es but did not arrived to https://meta.wikimedia.org/wiki/Help:Two-factor_authentication/es Nikerabbit [20:49:55] not sure if that was what you were looking for when testing [20:50:03] hauskater: yeah, this patch wasn't supposed to fix that [20:50:13] ah, alright then [20:52:03] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez thanks for the update. The plan was for us to decommission one lvs in rack A2 and B2 so I can use the existing cables to setup lvs2007 and lvs2008. Both lvs2007 and 2008 ar... [20:53:20] hauskater: patches for that issue come tomorrow with train, I believe [20:54:39] hauskater: bad values will persist for 7 days. longer if they made it into the ParserCache, but there they can be fixed by purging the page [20:54:53] ?action=purge duesen ? [20:55:08] what about running translate/scripts/update-translatable-pages ? [20:55:15] eventually, yes - but only after the internal cache has expired, or has been purged manually [20:55:21] that can't be done over the webn [20:55:43] s/update/refresh [20:55:50] Yeah, I will run those scripts, but only once it is safe [20:56:05] I cannot thank you enough [20:56:05] yea, these scripts also rely on the blob cache [20:56:18] and run as www-user iirc [20:56:20] if the blob cache is polluted, all that depends on it will produce bad data [20:56:27] running these scripts now would make things worse [20:56:46] Noted [21:00:04] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T2100). [21:05:36] (03CR) 10Groceryheist: "Is there anything else I need to do here? I'm just waiting." [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [21:08:23] (03CR) 10Reedy: "You made your patch on a weekend, today is a WMF holiday... :)" [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [21:12:52] !log Delete misc arclamp/logs and arclamp/svgs data from between 2018 and and 2019-08 on webperf1002/webperf2002, T235425 [21:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:56] T235425: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 [21:15:03] (03CR) 10Groceryheist: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/542621 (owner: 10Groceryheist) [21:16:18] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:16:37] Reedy: which holliday is today? [21:16:47] Random WMF one [21:17:06] no federal holiday then? [21:17:15] >This day holds a lot of weight for many in our staff and community as a variety of commemorations and celebrations occur, such as Indigenous Peoples' Day, Mother’s Day in Malawi, Thanksgiving in Canada and National Day in China. [21:17:16] Nope [21:17:27] https://en.wikipedia.org/wiki/Columbus_Day#Local_observance_of_Columbus_Day [21:18:18] ack [21:26:54] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:47:54] !log Deleting 2019-09-01––2019-09-10 arclamp logs on webperf2002, and decompress the rest of 2019-09, T235425 [21:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:58] T235425: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 [21:49:08] 10Operations, 10Arc-Lamp, 10Performance-Team, 10serviceops, 10Patch-For-Review: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 (10Krinkle) [22:25:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10EYener) Hi @Nuria and team, is there anything needed from my end on this ticket? [22:27:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10Nuria) Approved on my end. [22:31:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10Nuria) @jrobell if the future to do this more smoothly I recommend the users that need access file the ticket themselves,... [22:42:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10Nuria) I think once change above gets merged @EYener should have access to both data and tools. [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191014T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:59] (03PS3) 10Krinkle: Remove wmgReduceStartupExpiry (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542642 (https://phabricator.wikimedia.org/T235314) [23:04:14] (03CR) 10Krinkle: [C: 03+2] Remove wmgReduceStartupExpiry (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542642 (https://phabricator.wikimedia.org/T235314) (owner: 10Krinkle) [23:05:03] (03Merged) 10jenkins-bot: Remove wmgReduceStartupExpiry (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542642 (https://phabricator.wikimedia.org/T235314) (owner: 10Krinkle) [23:05:12] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 7.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [23:10:07] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 86f12b6e (duration: 00m 51s) [23:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:59] !log Delete 2019-09-01––2019-09-10 arclamp trace logs from webperf1002, and decompress the rest of 2019-09 (this will trigger svg re-generation), T235425 [23:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:03] T235425: webperf*002 running out of disk space (arc lamp, xhgui) - https://phabricator.wikimedia.org/T235425 [23:55:21] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10Andrew) > Error while evaluating a Function Call, Could not find data item profile::trafficserver::tls::parent_rules If you don't... [23:55:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Analytics Access for Grant - https://phabricator.wikimedia.org/T235260 (10gsingers) My user is `Grant Ingersoll`